Enabling and optimizing multi-modal sense-making for human-AI interaction tasks

The rapid pace of adoption of mixed-reality in tandem with advances in NLP and computer vision have opened up unprecedented opportunities for more naturalistic interaction interfaces which underpin Human-AI collaborative applications such as spatial computing and interactive conversational agents. O...

Full description

Saved in:
Bibliographic Details
Main Author: WEERAKOON MUDIYANSELAGE, Dulanga Kaveesha Weerakoon
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/etd_coll/602
https://ink.library.smu.edu.sg/context/etd_coll/article/1600/viewcontent/GPIS_AY2019_PhD_DulangaWeerakoon.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.etd_coll-1600
record_format dspace
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Human-AI Collaboration
Referring Expression Comprehension
Visual Grounding
Spatio-Temporal Video Grounding
Dynamic Model Optimizations
Multi-Modal Processing
Artificial Intelligence and Robotics
spellingShingle Human-AI Collaboration
Referring Expression Comprehension
Visual Grounding
Spatio-Temporal Video Grounding
Dynamic Model Optimizations
Multi-Modal Processing
Artificial Intelligence and Robotics
WEERAKOON MUDIYANSELAGE, Dulanga Kaveesha Weerakoon
Enabling and optimizing multi-modal sense-making for human-AI interaction tasks
description The rapid pace of adoption of mixed-reality in tandem with advances in NLP and computer vision have opened up unprecedented opportunities for more naturalistic interaction interfaces which underpin Human-AI collaborative applications such as spatial computing and interactive conversational agents. One notable example is the emergence of interactive virtual assistants, which facilitate more natural communication of instructions and queries through modalities like voice and text. This trend is driving the development of innovative ubiquitous, mixed-reality computing applications. Such interactive, natural communication is also critical to support advances in human-robot interactive co-working, across a variety of industrial, commercial and home environments. Conventional voice-based conversational agents, exemplified by technologies such as Apple’s Siri and Amazon’s Alexa, are evolving into increasingly multi-modal systems, which can now support the comprehension of human instructions through a combination of language, gestures, and visual inputs. The intelligence behind these conversational agents relies on sophisticated Deep Neural Network (DNN) models. Sophisticated Deep Neural Network (DNN) based architectures (e.g., Transformers), which underlie the recent emergence of Large Language Models (LLMs) and Vision Language Models (VLMs), have recently dramatically enhanced the ability of AI software to comprehend a mix of visual and natural textual/verbal cues. While these models exhibit increasing accuracy, their computationally intensive nature and large model sizes pose challenges for supporting low-latency, on-device execution of inference tasks, especially on resource-constrained wearable and Internet of Things (IoT) devices like Microsoft HoloLens or Nvidia Jetson platforms. Thus, my research is centred on enabling the execution of these multi-modal human interactive tasks, with a specific focus on comprehending human visual grounding instructions, on resource-constrained devices. The goal is to achieve low-power, low-latency execution while maintaining comparable task accuracy, thereby preserving interactivity. Natural human-human interaction is inherently multi-modal, as we use a variety of modalities including verbal commands, gestures and facial expressions, visual cues, gaze and even vocal nuances (e.g., tone and rhythm) to mutually convey our intent. Motivated by such human-human interaction scenarios, this thesis broadly investigates some methods to enable multi-modal sense-making for human issued instructions or queries in resource-constrained wearable and edge devices. In particular, we consider object acquisition as an exemplary task for human-AI collaboration that can benefit from enabling the support for comprehending naturalistic multi-modal instructions. To address this, we leverage Referring Expression Comprehension (REC) or Visual Grounding models developed in computer vision and NLP literature. These models, when provided with an image along with verbal and/or gestural inputs, identify the bounding box of the referred object. We then introduce a number of sensemaking models and optimization techniques to support low-latency execution of such models for inferencing on pervasive devices. In this thesis, our emphasis will be predominantly on exploring diverse dynamic optimizations for the comprehension of task instructions. Throughout these investigations, we rely on a common guiding principle which underscores our approach: the acknowledgement that not all instructions pose the same level of task complexity. To illustrate, consider the varying complexities introduced by different types of instructions. In a cluttered environment, identifying a target object often necessitates a more intricate execution pipeline to ensure accurate identification. Users may employ a combination of language instructions and pointing gestures, which can aid the model in disambiguating among closely situated objects. Consequently, the presence of multiple modalities helps alleviate task complexity. Conversely, in a less cluttered space, a simple pointing gesture may suffice for object identification, requiring a less complex execution pipeline. This nuanced understanding of task complexities serves as the foundation for the dynamic optimization techniques explored in subsequent chapters. This dissertation is organized into two parts. Part I: Image-based Human Instruction Comprehension focuses on studying model optimizations applied to REC models, which process a single static image along with language and, optionally, gestural modalities. In Part II: Video-based Human Instruction Comprehension, we extend our methodologies to more complex scenarios involving videos as vision input, moving beyond single static images.
format text
author WEERAKOON MUDIYANSELAGE, Dulanga Kaveesha Weerakoon
author_facet WEERAKOON MUDIYANSELAGE, Dulanga Kaveesha Weerakoon
author_sort WEERAKOON MUDIYANSELAGE, Dulanga Kaveesha Weerakoon
title Enabling and optimizing multi-modal sense-making for human-AI interaction tasks
title_short Enabling and optimizing multi-modal sense-making for human-AI interaction tasks
title_full Enabling and optimizing multi-modal sense-making for human-AI interaction tasks
title_fullStr Enabling and optimizing multi-modal sense-making for human-AI interaction tasks
title_full_unstemmed Enabling and optimizing multi-modal sense-making for human-AI interaction tasks
title_sort enabling and optimizing multi-modal sense-making for human-ai interaction tasks
publisher Institutional Knowledge at Singapore Management University
publishDate 2024
url https://ink.library.smu.edu.sg/etd_coll/602
https://ink.library.smu.edu.sg/context/etd_coll/article/1600/viewcontent/GPIS_AY2019_PhD_DulangaWeerakoon.pdf
_version_ 1814047645506732032
spelling sg-smu-ink.etd_coll-16002024-07-17T08:11:42Z Enabling and optimizing multi-modal sense-making for human-AI interaction tasks WEERAKOON MUDIYANSELAGE, Dulanga Kaveesha Weerakoon The rapid pace of adoption of mixed-reality in tandem with advances in NLP and computer vision have opened up unprecedented opportunities for more naturalistic interaction interfaces which underpin Human-AI collaborative applications such as spatial computing and interactive conversational agents. One notable example is the emergence of interactive virtual assistants, which facilitate more natural communication of instructions and queries through modalities like voice and text. This trend is driving the development of innovative ubiquitous, mixed-reality computing applications. Such interactive, natural communication is also critical to support advances in human-robot interactive co-working, across a variety of industrial, commercial and home environments. Conventional voice-based conversational agents, exemplified by technologies such as Apple’s Siri and Amazon’s Alexa, are evolving into increasingly multi-modal systems, which can now support the comprehension of human instructions through a combination of language, gestures, and visual inputs. The intelligence behind these conversational agents relies on sophisticated Deep Neural Network (DNN) models. Sophisticated Deep Neural Network (DNN) based architectures (e.g., Transformers), which underlie the recent emergence of Large Language Models (LLMs) and Vision Language Models (VLMs), have recently dramatically enhanced the ability of AI software to comprehend a mix of visual and natural textual/verbal cues. While these models exhibit increasing accuracy, their computationally intensive nature and large model sizes pose challenges for supporting low-latency, on-device execution of inference tasks, especially on resource-constrained wearable and Internet of Things (IoT) devices like Microsoft HoloLens or Nvidia Jetson platforms. Thus, my research is centred on enabling the execution of these multi-modal human interactive tasks, with a specific focus on comprehending human visual grounding instructions, on resource-constrained devices. The goal is to achieve low-power, low-latency execution while maintaining comparable task accuracy, thereby preserving interactivity. Natural human-human interaction is inherently multi-modal, as we use a variety of modalities including verbal commands, gestures and facial expressions, visual cues, gaze and even vocal nuances (e.g., tone and rhythm) to mutually convey our intent. Motivated by such human-human interaction scenarios, this thesis broadly investigates some methods to enable multi-modal sense-making for human issued instructions or queries in resource-constrained wearable and edge devices. In particular, we consider object acquisition as an exemplary task for human-AI collaboration that can benefit from enabling the support for comprehending naturalistic multi-modal instructions. To address this, we leverage Referring Expression Comprehension (REC) or Visual Grounding models developed in computer vision and NLP literature. These models, when provided with an image along with verbal and/or gestural inputs, identify the bounding box of the referred object. We then introduce a number of sensemaking models and optimization techniques to support low-latency execution of such models for inferencing on pervasive devices. In this thesis, our emphasis will be predominantly on exploring diverse dynamic optimizations for the comprehension of task instructions. Throughout these investigations, we rely on a common guiding principle which underscores our approach: the acknowledgement that not all instructions pose the same level of task complexity. To illustrate, consider the varying complexities introduced by different types of instructions. In a cluttered environment, identifying a target object often necessitates a more intricate execution pipeline to ensure accurate identification. Users may employ a combination of language instructions and pointing gestures, which can aid the model in disambiguating among closely situated objects. Consequently, the presence of multiple modalities helps alleviate task complexity. Conversely, in a less cluttered space, a simple pointing gesture may suffice for object identification, requiring a less complex execution pipeline. This nuanced understanding of task complexities serves as the foundation for the dynamic optimization techniques explored in subsequent chapters. This dissertation is organized into two parts. Part I: Image-based Human Instruction Comprehension focuses on studying model optimizations applied to REC models, which process a single static image along with language and, optionally, gestural modalities. In Part II: Video-based Human Instruction Comprehension, we extend our methodologies to more complex scenarios involving videos as vision input, moving beyond single static images. 2024-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/etd_coll/602 https://ink.library.smu.edu.sg/context/etd_coll/article/1600/viewcontent/GPIS_AY2019_PhD_DulangaWeerakoon.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Dissertations and Theses Collection (Open Access) eng Institutional Knowledge at Singapore Management University Human-AI Collaboration Referring Expression Comprehension Visual Grounding Spatio-Temporal Video Grounding Dynamic Model Optimizations Multi-Modal Processing Artificial Intelligence and Robotics