Language-guided object segmentation

Language-guided Video Object Segmentation (LVOS) is a multi-modal AI task that segments objects in videos based on natural language expressions. Although there has been significant research on Referring-Video Object Segmentation (R-VOS), which enables LVOS, these methods still face limitations that...

Full description

Saved in:
Bibliographic Details
Main Author: John Benedict, Remelia Shirlley
Other Authors: Chen Change Loy
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/175326
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-175326
record_format dspace
spelling sg-ntu-dr.10356-1753262024-04-26T15:44:49Z Language-guided object segmentation John Benedict, Remelia Shirlley Chen Change Loy School of Computer Science and Engineering ccloy@ntu.edu.sg Computer and Information Science Artificial intelligence Language-guided Video Object Segmentation (LVOS) is a multi-modal AI task that segments objects in videos based on natural language expressions. Although there has been significant research on Referring-Video Object Segmentation (R-VOS), which enables LVOS, these methods still face limitations that prevent accurate LVOS performance in real-life scenarios. Current R-VOS methods often rely on datasets featuring predominantly static attributes like object colour and category names or focus on singular objects identifiable in a single frame. This approach undermines the importance of tracking the target object's motion over time, leading to the failure of R-VOS models in capturing fleeting movements and long-term actions. The Motion expressions Video Segmentation (MeViS) dataset, which prioritizes the temporal dynamics in videos, is used to overcome this challenge. This approach requires LVOS models to recognize temporal context and have attention to the target object, a capability lacking in existing R-VOS methods. This report expands on the Language-guided Motion Perception and Matching (LMPM) model, a baseline model developed using the MeViS dataset and seeks to improve the robustness of the LMPM model, specifically by addressing the challenges posed by uncertain user text input. Bachelor's degree 2024-04-23T06:30:53Z 2024-04-23T06:30:53Z 2024 Final Year Project (FYP) John Benedict, R. S. (2024). Language-guided object segmentation. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/175326 https://hdl.handle.net/10356/175326 en SCSE23-0379 application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
Artificial intelligence
spellingShingle Computer and Information Science
Artificial intelligence
John Benedict, Remelia Shirlley
Language-guided object segmentation
description Language-guided Video Object Segmentation (LVOS) is a multi-modal AI task that segments objects in videos based on natural language expressions. Although there has been significant research on Referring-Video Object Segmentation (R-VOS), which enables LVOS, these methods still face limitations that prevent accurate LVOS performance in real-life scenarios. Current R-VOS methods often rely on datasets featuring predominantly static attributes like object colour and category names or focus on singular objects identifiable in a single frame. This approach undermines the importance of tracking the target object's motion over time, leading to the failure of R-VOS models in capturing fleeting movements and long-term actions. The Motion expressions Video Segmentation (MeViS) dataset, which prioritizes the temporal dynamics in videos, is used to overcome this challenge. This approach requires LVOS models to recognize temporal context and have attention to the target object, a capability lacking in existing R-VOS methods. This report expands on the Language-guided Motion Perception and Matching (LMPM) model, a baseline model developed using the MeViS dataset and seeks to improve the robustness of the LMPM model, specifically by addressing the challenges posed by uncertain user text input.
author2 Chen Change Loy
author_facet Chen Change Loy
John Benedict, Remelia Shirlley
format Final Year Project
author John Benedict, Remelia Shirlley
author_sort John Benedict, Remelia Shirlley
title Language-guided object segmentation
title_short Language-guided object segmentation
title_full Language-guided object segmentation
title_fullStr Language-guided object segmentation
title_full_unstemmed Language-guided object segmentation
title_sort language-guided object segmentation
publisher Nanyang Technological University
publishDate 2024
url https://hdl.handle.net/10356/175326
_version_ 1806059781595594752