Enabling and optimizing multi-modal sense-making for human-AI interaction tasks
The rapid pace of adoption of mixed-reality in tandem with advances in NLP and computer vision have opened up unprecedented opportunities for more naturalistic interaction interfaces which underpin Human-AI collaborative applications such as spatial computing and interactive conversational agents. O...
Saved in:
Main Author: | |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2024
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/etd_coll/602 https://ink.library.smu.edu.sg/context/etd_coll/article/1600/viewcontent/GPIS_AY2019_PhD_DulangaWeerakoon.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.etd_coll-1600 |
---|---|
record_format |
dspace |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Human-AI Collaboration Referring Expression Comprehension Visual Grounding Spatio-Temporal Video Grounding Dynamic Model Optimizations Multi-Modal Processing Artificial Intelligence and Robotics |
spellingShingle |
Human-AI Collaboration Referring Expression Comprehension Visual Grounding Spatio-Temporal Video Grounding Dynamic Model Optimizations Multi-Modal Processing Artificial Intelligence and Robotics WEERAKOON MUDIYANSELAGE, Dulanga Kaveesha Weerakoon Enabling and optimizing multi-modal sense-making for human-AI interaction tasks |
description |
The rapid pace of adoption of mixed-reality in tandem with advances in NLP and computer vision have opened up unprecedented opportunities for more naturalistic interaction interfaces which underpin Human-AI collaborative applications such as spatial computing and interactive conversational agents. One notable example is the emergence of interactive virtual assistants, which facilitate more natural communication of instructions and queries through modalities like voice and text. This trend is driving the development of innovative ubiquitous, mixed-reality computing applications. Such interactive, natural communication is also critical to support advances in human-robot interactive co-working, across a variety of industrial, commercial and home environments. Conventional voice-based conversational agents, exemplified by technologies such as Apple’s Siri and Amazon’s Alexa, are evolving into increasingly multi-modal systems, which can now support the comprehension of human instructions through a combination of language, gestures, and visual inputs. The intelligence behind these conversational agents relies on sophisticated Deep Neural Network (DNN) models. Sophisticated Deep Neural Network (DNN) based architectures (e.g., Transformers), which underlie the recent emergence of Large Language Models (LLMs) and Vision Language Models (VLMs), have recently dramatically enhanced the ability of AI software to comprehend a mix of visual and natural textual/verbal cues. While these models exhibit increasing accuracy, their computationally intensive nature and large model sizes pose challenges for supporting low-latency, on-device execution of inference tasks, especially on resource-constrained wearable and Internet of Things (IoT) devices like Microsoft HoloLens or Nvidia Jetson platforms. Thus, my research is centred on enabling the execution of these multi-modal human interactive tasks, with a specific focus on comprehending human visual grounding instructions, on resource-constrained devices. The goal is to achieve low-power, low-latency execution while maintaining comparable task accuracy, thereby preserving interactivity.
Natural human-human interaction is inherently multi-modal, as we use a variety of modalities including verbal commands, gestures and facial expressions, visual cues, gaze and even vocal nuances (e.g., tone and rhythm) to mutually convey our intent. Motivated by such human-human interaction scenarios, this thesis broadly investigates some methods to enable multi-modal sense-making for human issued instructions or queries in resource-constrained wearable and edge devices. In particular, we consider object acquisition as an exemplary task for human-AI collaboration that can benefit from enabling the support for comprehending naturalistic multi-modal instructions. To address this, we leverage Referring Expression Comprehension (REC) or Visual Grounding models developed in computer vision and NLP literature. These models, when provided with an image along with verbal and/or gestural inputs, identify the bounding box of the referred object. We then introduce a number of sensemaking models and optimization techniques to support low-latency execution of such models for inferencing on pervasive devices.
In this thesis, our emphasis will be predominantly on exploring diverse dynamic optimizations for the comprehension of task instructions. Throughout these investigations, we rely on a common guiding principle which underscores our approach: the acknowledgement that not all instructions pose the same level of task complexity. To illustrate, consider the varying complexities introduced by different types of instructions. In a cluttered environment, identifying a target object often necessitates a more intricate execution pipeline to ensure accurate identification. Users may employ a combination of language instructions and pointing gestures, which can aid the model in disambiguating among closely situated objects. Consequently, the presence of multiple modalities helps alleviate task complexity. Conversely, in a less cluttered space, a simple pointing gesture may suffice for object identification, requiring a less complex execution pipeline. This nuanced understanding of task complexities serves as the foundation for the dynamic optimization techniques explored in subsequent chapters.
This dissertation is organized into two parts. Part I: Image-based Human Instruction Comprehension focuses on studying model optimizations applied to REC models, which process a single static image along with language and, optionally, gestural modalities. In Part II: Video-based Human Instruction Comprehension, we extend our methodologies to more complex scenarios involving videos as vision input, moving beyond single static images. |
format |
text |
author |
WEERAKOON MUDIYANSELAGE, Dulanga Kaveesha Weerakoon |
author_facet |
WEERAKOON MUDIYANSELAGE, Dulanga Kaveesha Weerakoon |
author_sort |
WEERAKOON MUDIYANSELAGE, Dulanga Kaveesha Weerakoon |
title |
Enabling and optimizing multi-modal sense-making for human-AI interaction tasks |
title_short |
Enabling and optimizing multi-modal sense-making for human-AI interaction tasks |
title_full |
Enabling and optimizing multi-modal sense-making for human-AI interaction tasks |
title_fullStr |
Enabling and optimizing multi-modal sense-making for human-AI interaction tasks |
title_full_unstemmed |
Enabling and optimizing multi-modal sense-making for human-AI interaction tasks |
title_sort |
enabling and optimizing multi-modal sense-making for human-ai interaction tasks |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2024 |
url |
https://ink.library.smu.edu.sg/etd_coll/602 https://ink.library.smu.edu.sg/context/etd_coll/article/1600/viewcontent/GPIS_AY2019_PhD_DulangaWeerakoon.pdf |
_version_ |
1814047645506732032 |
spelling |
sg-smu-ink.etd_coll-16002024-07-17T08:11:42Z Enabling and optimizing multi-modal sense-making for human-AI interaction tasks WEERAKOON MUDIYANSELAGE, Dulanga Kaveesha Weerakoon The rapid pace of adoption of mixed-reality in tandem with advances in NLP and computer vision have opened up unprecedented opportunities for more naturalistic interaction interfaces which underpin Human-AI collaborative applications such as spatial computing and interactive conversational agents. One notable example is the emergence of interactive virtual assistants, which facilitate more natural communication of instructions and queries through modalities like voice and text. This trend is driving the development of innovative ubiquitous, mixed-reality computing applications. Such interactive, natural communication is also critical to support advances in human-robot interactive co-working, across a variety of industrial, commercial and home environments. Conventional voice-based conversational agents, exemplified by technologies such as Apple’s Siri and Amazon’s Alexa, are evolving into increasingly multi-modal systems, which can now support the comprehension of human instructions through a combination of language, gestures, and visual inputs. The intelligence behind these conversational agents relies on sophisticated Deep Neural Network (DNN) models. Sophisticated Deep Neural Network (DNN) based architectures (e.g., Transformers), which underlie the recent emergence of Large Language Models (LLMs) and Vision Language Models (VLMs), have recently dramatically enhanced the ability of AI software to comprehend a mix of visual and natural textual/verbal cues. While these models exhibit increasing accuracy, their computationally intensive nature and large model sizes pose challenges for supporting low-latency, on-device execution of inference tasks, especially on resource-constrained wearable and Internet of Things (IoT) devices like Microsoft HoloLens or Nvidia Jetson platforms. Thus, my research is centred on enabling the execution of these multi-modal human interactive tasks, with a specific focus on comprehending human visual grounding instructions, on resource-constrained devices. The goal is to achieve low-power, low-latency execution while maintaining comparable task accuracy, thereby preserving interactivity. Natural human-human interaction is inherently multi-modal, as we use a variety of modalities including verbal commands, gestures and facial expressions, visual cues, gaze and even vocal nuances (e.g., tone and rhythm) to mutually convey our intent. Motivated by such human-human interaction scenarios, this thesis broadly investigates some methods to enable multi-modal sense-making for human issued instructions or queries in resource-constrained wearable and edge devices. In particular, we consider object acquisition as an exemplary task for human-AI collaboration that can benefit from enabling the support for comprehending naturalistic multi-modal instructions. To address this, we leverage Referring Expression Comprehension (REC) or Visual Grounding models developed in computer vision and NLP literature. These models, when provided with an image along with verbal and/or gestural inputs, identify the bounding box of the referred object. We then introduce a number of sensemaking models and optimization techniques to support low-latency execution of such models for inferencing on pervasive devices. In this thesis, our emphasis will be predominantly on exploring diverse dynamic optimizations for the comprehension of task instructions. Throughout these investigations, we rely on a common guiding principle which underscores our approach: the acknowledgement that not all instructions pose the same level of task complexity. To illustrate, consider the varying complexities introduced by different types of instructions. In a cluttered environment, identifying a target object often necessitates a more intricate execution pipeline to ensure accurate identification. Users may employ a combination of language instructions and pointing gestures, which can aid the model in disambiguating among closely situated objects. Consequently, the presence of multiple modalities helps alleviate task complexity. Conversely, in a less cluttered space, a simple pointing gesture may suffice for object identification, requiring a less complex execution pipeline. This nuanced understanding of task complexities serves as the foundation for the dynamic optimization techniques explored in subsequent chapters. This dissertation is organized into two parts. Part I: Image-based Human Instruction Comprehension focuses on studying model optimizations applied to REC models, which process a single static image along with language and, optionally, gestural modalities. In Part II: Video-based Human Instruction Comprehension, we extend our methodologies to more complex scenarios involving videos as vision input, moving beyond single static images. 2024-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/etd_coll/602 https://ink.library.smu.edu.sg/context/etd_coll/article/1600/viewcontent/GPIS_AY2019_PhD_DulangaWeerakoon.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Dissertations and Theses Collection (Open Access) eng Institutional Knowledge at Singapore Management University Human-AI Collaboration Referring Expression Comprehension Visual Grounding Spatio-Temporal Video Grounding Dynamic Model Optimizations Multi-Modal Processing Artificial Intelligence and Robotics |