Language-guided object segmentation
Language-guided Video Object Segmentation (LVOS) is a multi-modal AI task that segments objects in videos based on natural language expressions. Although there has been significant research on Referring-Video Object Segmentation (R-VOS), which enables LVOS, these methods still face limitations that...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/175326 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Language-guided Video Object Segmentation (LVOS) is a multi-modal AI task that segments objects in videos based on natural language expressions. Although there has been significant research on Referring-Video Object Segmentation (R-VOS), which enables LVOS, these methods still face limitations that prevent accurate LVOS performance in real-life scenarios. Current R-VOS methods often rely on datasets featuring predominantly static attributes like object colour and category names or focus on singular objects identifiable in a single frame. This approach undermines the importance of tracking the target object's motion over time, leading to the failure of R-VOS models in capturing fleeting movements and long-term actions. The Motion expressions Video Segmentation (MeViS) dataset, which prioritizes the temporal dynamics in videos, is used to overcome this challenge. This approach requires LVOS models to recognize temporal context and have attention to the target object, a capability lacking in existing R-VOS methods. This report expands on the Language-guided Motion Perception and Matching (LMPM) model, a baseline model developed using the MeViS dataset and seeks to improve the robustness of the LMPM model, specifically by addressing the challenges posed by uncertain user text input. |
---|