Language-guided object segmentation

Language-guided Video Object Segmentation (LVOS) is a multi-modal AI task that segments objects in videos based on natural language expressions. Although there has been significant research on Referring-Video Object Segmentation (R-VOS), which enables LVOS, these methods still face limitations that...

Full description

Saved in:
Bibliographic Details
Main Author: John Benedict, Remelia Shirlley
Other Authors: Chen Change Loy
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/175326
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Language-guided Video Object Segmentation (LVOS) is a multi-modal AI task that segments objects in videos based on natural language expressions. Although there has been significant research on Referring-Video Object Segmentation (R-VOS), which enables LVOS, these methods still face limitations that prevent accurate LVOS performance in real-life scenarios. Current R-VOS methods often rely on datasets featuring predominantly static attributes like object colour and category names or focus on singular objects identifiable in a single frame. This approach undermines the importance of tracking the target object's motion over time, leading to the failure of R-VOS models in capturing fleeting movements and long-term actions. The Motion expressions Video Segmentation (MeViS) dataset, which prioritizes the temporal dynamics in videos, is used to overcome this challenge. This approach requires LVOS models to recognize temporal context and have attention to the target object, a capability lacking in existing R-VOS methods. This report expands on the Language-guided Motion Perception and Matching (LMPM) model, a baseline model developed using the MeViS dataset and seeks to improve the robustness of the LMPM model, specifically by addressing the challenges posed by uncertain user text input.