Poster: Towards efficient spatio-temporal video grounding in pervasive mobile devices

As the use of pervasive devices expands into complex collaborative tasks such as cognitive assistants and interactive AR/VR companions, they are equipped with a myriad of sensors facilitating natural interactions, such as voice commands. Spatio-Temporal Video Grounding (STVG), the task of identifyin...

Full description

Saved in:
Bibliographic Details
Main Authors: WEERAKOON MUDIYANSELAGE, Dulanga Kaveesha, SUBBARAJU, Vigneshwaran, LIM, Joo Hwee, Misra, Archan
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9219
https://ink.library.smu.edu.sg/context/sis_research/article/10177/viewcontent/Mobisys2024_ProfilingEventVision_PosterPaper_cameraready.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
Description
Summary:As the use of pervasive devices expands into complex collaborative tasks such as cognitive assistants and interactive AR/VR companions, they are equipped with a myriad of sensors facilitating natural interactions, such as voice commands. Spatio-Temporal Video Grounding (STVG), the task of identifying the target object in the field-of-view referred to in a language instruction, is a key capability needed for such systems. However, current STVG models tend to be resource-intensive, relying on multiple cross-attentional transformers applied to each video frame. This results in runtime complexity that increases linearly with video length. Furthermore, deploying these models on mobile devices while maintaining a low-latency poses additional challenges. Hence, this paper explores the latency and energy requirements for implementing STVG models on a pervasive device.