Enabling and optimizing multi-modal sense-making for human-AI interaction tasks

The rapid pace of adoption of mixed-reality in tandem with advances in NLP and computer vision have opened up unprecedented opportunities for more naturalistic interaction interfaces which underpin Human-AI collaborative applications such as spatial computing and interactive conversational agents. O...

Full description

Saved in:

Bibliographic Details
Main Author:	WEERAKOON MUDIYANSELAGE, Dulanga Kaveesha Weerakoon
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Human-AI Collaboration Referring Expression Comprehension Visual Grounding Spatio-Temporal Video Grounding Dynamic Model Optimizations Multi-Modal Processing Artificial Intelligence and Robotics
Online Access:	https://ink.library.smu.edu.sg/etd_coll/602 https://ink.library.smu.edu.sg/context/etd_coll/article/1600/viewcontent/GPIS_AY2019_PhD_DulangaWeerakoon.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.etd_coll-1600
record_format	dspace
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Human-AI Collaboration Referring Expression Comprehension Visual Grounding Spatio-Temporal Video Grounding Dynamic Model Optimizations Multi-Modal Processing Artificial Intelligence and Robotics
spellingShingle	Human-AI Collaboration Referring Expression Comprehension Visual Grounding Spatio-Temporal Video Grounding Dynamic Model Optimizations Multi-Modal Processing Artificial Intelligence and Robotics WEERAKOON MUDIYANSELAGE, Dulanga Kaveesha Weerakoon Enabling and optimizing multi-modal sense-making for human-AI interaction tasks
description	The rapid pace of adoption of mixed-reality in tandem with advances in NLP and computer vision have opened up unprecedented opportunities for more naturalistic interaction interfaces which underpin Human-AI collaborative applications such as spatial computing and interactive conversational agents. One notable example is the emergence of interactive virtual assistants, which facilitate more natural communication of instructions and queries through modalities like voice and text. This trend is driving the development of innovative ubiquitous, mixed-reality computing applications. Such interactive, natural communication is also critical to support advances in human-robot interactive co-working, across a variety of industrial, commercial and home environments. Conventional voice-based conversational agents, exemplified by technologies such as Apple’s Siri and Amazon’s Alexa, are evolving into increasingly multi-modal systems, which can now support the comprehension of human instructions through a combination of language, gestures, and visual inputs. The intelligence behind these conversational agents relies on sophisticated Deep Neural Network (DNN) models. Sophisticated Deep Neural Network (DNN) based architectures (e.g., Transformers), which underlie the recent emergence of Large Language Models (LLMs) and Vision Language Models (VLMs), have recently dramatically enhanced the ability of AI software to comprehend a mix of visual and natural textual/verbal cues. While these models exhibit increasing accuracy, their computationally intensive nature and large model sizes pose challenges for supporting low-latency, on-device execution of inference tasks, especially on resource-constrained wearable and Internet of Things (IoT) devices like Microsoft HoloLens or Nvidia Jetson platforms. Thus, my research is centred on enabling the execution of these multi-modal human interactive tasks, with a specific focus on comprehending human visual grounding instructions, on resource-constrained devices. The goal is to achieve low-power, low-latency execution while maintaining comparable task accuracy, thereby preserving interactivity. Natural human-human interaction is inherently multi-modal, as we use a variety of modalities including verbal commands, gestures and facial expressions, visual cues, gaze and even vocal nuances (e.g., tone and rhythm) to mutually convey our intent. Motivated by such human-human interaction scenarios, this thesis broadly investigates some methods to enable multi-modal sense-making for human issued instructions or queries in resource-constrained wearable and edge devices. In particular, we consider object acquisition as an exemplary task for human-AI collaboration that can benefit from enabling the support for comprehending naturalistic multi-modal instructions. To address this, we leverage Referring Expression Comprehension (REC) or Visual Grounding models developed in computer vision and NLP literature. These models, when provided with an image along with verbal and/or gestural inputs, identify the bounding box of the referred object. We then introduce a number of sensemaking models and optimization techniques to support low-latency execution of such models for inferencing on pervasive devices. In this thesis, our emphasis will be predominantly on exploring diverse dynamic optimizations for the comprehension of task instructions. Throughout these investigations, we rely on a common guiding principle which underscores our approach: the acknowledgement that not all instructions pose the same level of task complexity. To illustrate, consider the varying complexities introduced by different types of instructions. In a cluttered environment, identifying a target object often necessitates a more intricate execution pipeline to ensure accurate identification. Users may employ a combination of language instructions and pointing gestures, which can aid the model in disambiguating among closely situated objects. Consequently, the presence of multiple modalities helps alleviate task complexity. Conversely, in a less cluttered space, a simple pointing gesture may suffice for object identification, requiring a less complex execution pipeline. This nuanced understanding of task complexities serves as the foundation for the dynamic optimization techniques explored in subsequent chapters. This dissertation is organized into two parts. Part I: Image-based Human Instruction Comprehension focuses on studying model optimizations applied to REC models, which process a single static image along with language and, optionally, gestural modalities. In Part II: Video-based Human Instruction Comprehension, we extend our methodologies to more complex scenarios involving videos as vision input, moving beyond single static images.
format	text
author	WEERAKOON MUDIYANSELAGE, Dulanga Kaveesha Weerakoon
author_facet	WEERAKOON MUDIYANSELAGE, Dulanga Kaveesha Weerakoon
author_sort	WEERAKOON MUDIYANSELAGE, Dulanga Kaveesha Weerakoon
title	Enabling and optimizing multi-modal sense-making for human-AI interaction tasks
title_short	Enabling and optimizing multi-modal sense-making for human-AI interaction tasks
title_full	Enabling and optimizing multi-modal sense-making for human-AI interaction tasks
title_fullStr	Enabling and optimizing multi-modal sense-making for human-AI interaction tasks
title_full_unstemmed	Enabling and optimizing multi-modal sense-making for human-AI interaction tasks
title_sort	enabling and optimizing multi-modal sense-making for human-ai interaction tasks
publisher	Institutional Knowledge at Singapore Management University
publishDate	2024
url	https://ink.library.smu.edu.sg/etd_coll/602 https://ink.library.smu.edu.sg/context/etd_coll/article/1600/viewcontent/GPIS_AY2019_PhD_DulangaWeerakoon.pdf
_version_	1814047645506732032
spelling	sg-smu-ink.etd_coll-16002024-07-17T08:11:42Z Enabling and optimizing multi-modal sense-making for human-AI interaction tasks WEERAKOON MUDIYANSELAGE, Dulanga Kaveesha Weerakoon The rapid pace of adoption of mixed-reality in tandem with advances in NLP and computer vision have opened up unprecedented opportunities for more naturalistic interaction interfaces which underpin Human-AI collaborative applications such as spatial computing and interactive conversational agents. One notable example is the emergence of interactive virtual assistants, which facilitate more natural communication of instructions and queries through modalities like voice and text. This trend is driving the development of innovative ubiquitous, mixed-reality computing applications. Such interactive, natural communication is also critical to support advances in human-robot interactive co-working, across a variety of industrial, commercial and home environments. Conventional voice-based conversational agents, exemplified by technologies such as Apple’s Siri and Amazon’s Alexa, are evolving into increasingly multi-modal systems, which can now support the comprehension of human instructions through a combination of language, gestures, and visual inputs. The intelligence behind these conversational agents relies on sophisticated Deep Neural Network (DNN) models. Sophisticated Deep Neural Network (DNN) based architectures (e.g., Transformers), which underlie the recent emergence of Large Language Models (LLMs) and Vision Language Models (VLMs), have recently dramatically enhanced the ability of AI software to comprehend a mix of visual and natural textual/verbal cues. While these models exhibit increasing accuracy, their computationally intensive nature and large model sizes pose challenges for supporting low-latency, on-device execution of inference tasks, especially on resource-constrained wearable and Internet of Things (IoT) devices like Microsoft HoloLens or Nvidia Jetson platforms. Thus, my research is centred on enabling the execution of these multi-modal human interactive tasks, with a specific focus on comprehending human visual grounding instructions, on resource-constrained devices. The goal is to achieve low-power, low-latency execution while maintaining comparable task accuracy, thereby preserving interactivity. Natural human-human interaction is inherently multi-modal, as we use a variety of modalities including verbal commands, gestures and facial expressions, visual cues, gaze and even vocal nuances (e.g., tone and rhythm) to mutually convey our intent. Motivated by such human-human interaction scenarios, this thesis broadly investigates some methods to enable multi-modal sense-making for human issued instructions or queries in resource-constrained wearable and edge devices. In particular, we consider object acquisition as an exemplary task for human-AI collaboration that can benefit from enabling the support for comprehending naturalistic multi-modal instructions. To address this, we leverage Referring Expression Comprehension (REC) or Visual Grounding models developed in computer vision and NLP literature. These models, when provided with an image along with verbal and/or gestural inputs, identify the bounding box of the referred object. We then introduce a number of sensemaking models and optimization techniques to support low-latency execution of such models for inferencing on pervasive devices. In this thesis, our emphasis will be predominantly on exploring diverse dynamic optimizations for the comprehension of task instructions. Throughout these investigations, we rely on a common guiding principle which underscores our approach: the acknowledgement that not all instructions pose the same level of task complexity. To illustrate, consider the varying complexities introduced by different types of instructions. In a cluttered environment, identifying a target object often necessitates a more intricate execution pipeline to ensure accurate identification. Users may employ a combination of language instructions and pointing gestures, which can aid the model in disambiguating among closely situated objects. Consequently, the presence of multiple modalities helps alleviate task complexity. Conversely, in a less cluttered space, a simple pointing gesture may suffice for object identification, requiring a less complex execution pipeline. This nuanced understanding of task complexities serves as the foundation for the dynamic optimization techniques explored in subsequent chapters. This dissertation is organized into two parts. Part I: Image-based Human Instruction Comprehension focuses on studying model optimizations applied to REC models, which process a single static image along with language and, optionally, gestural modalities. In Part II: Video-based Human Instruction Comprehension, we extend our methodologies to more complex scenarios involving videos as vision input, moving beyond single static images. 2024-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/etd_coll/602 https://ink.library.smu.edu.sg/context/etd_coll/article/1600/viewcontent/GPIS_AY2019_PhD_DulangaWeerakoon.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Dissertations and Theses Collection (Open Access) eng Institutional Knowledge at Singapore Management University Human-AI Collaboration Referring Expression Comprehension Visual Grounding Spatio-Temporal Video Grounding Dynamic Model Optimizations Multi-Modal Processing Artificial Intelligence and Robotics

Enabling and optimizing multi-modal sense-making for human-AI interaction tasks

Similar Items