Language-guided object segmentation
Language-guided Video Object Segmentation (LVOS) is a multi-modal AI task that segments objects in videos based on natural language expressions. Although there has been significant research on Referring-Video Object Segmentation (R-VOS), which enables LVOS, these methods still face limitations that...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/175326 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-175326 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1753262024-04-26T15:44:49Z Language-guided object segmentation John Benedict, Remelia Shirlley Chen Change Loy School of Computer Science and Engineering ccloy@ntu.edu.sg Computer and Information Science Artificial intelligence Language-guided Video Object Segmentation (LVOS) is a multi-modal AI task that segments objects in videos based on natural language expressions. Although there has been significant research on Referring-Video Object Segmentation (R-VOS), which enables LVOS, these methods still face limitations that prevent accurate LVOS performance in real-life scenarios. Current R-VOS methods often rely on datasets featuring predominantly static attributes like object colour and category names or focus on singular objects identifiable in a single frame. This approach undermines the importance of tracking the target object's motion over time, leading to the failure of R-VOS models in capturing fleeting movements and long-term actions. The Motion expressions Video Segmentation (MeViS) dataset, which prioritizes the temporal dynamics in videos, is used to overcome this challenge. This approach requires LVOS models to recognize temporal context and have attention to the target object, a capability lacking in existing R-VOS methods. This report expands on the Language-guided Motion Perception and Matching (LMPM) model, a baseline model developed using the MeViS dataset and seeks to improve the robustness of the LMPM model, specifically by addressing the challenges posed by uncertain user text input. Bachelor's degree 2024-04-23T06:30:53Z 2024-04-23T06:30:53Z 2024 Final Year Project (FYP) John Benedict, R. S. (2024). Language-guided object segmentation. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/175326 https://hdl.handle.net/10356/175326 en SCSE23-0379 application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science Artificial intelligence |
spellingShingle |
Computer and Information Science Artificial intelligence John Benedict, Remelia Shirlley Language-guided object segmentation |
description |
Language-guided Video Object Segmentation (LVOS) is a multi-modal AI task that segments objects in videos based on natural language expressions. Although there has been significant research on Referring-Video Object Segmentation (R-VOS), which enables LVOS, these methods still face limitations that prevent accurate LVOS performance in real-life scenarios. Current R-VOS methods often rely on datasets featuring predominantly static attributes like object colour and category names or focus on singular objects identifiable in a single frame. This approach undermines the importance of tracking the target object's motion over time, leading to the failure of R-VOS models in capturing fleeting movements and long-term actions. The Motion expressions Video Segmentation (MeViS) dataset, which prioritizes the temporal dynamics in videos, is used to overcome this challenge. This approach requires LVOS models to recognize temporal context and have attention to the target object, a capability lacking in existing R-VOS methods. This report expands on the Language-guided Motion Perception and Matching (LMPM) model, a baseline model developed using the MeViS dataset and seeks to improve the robustness of the LMPM model, specifically by addressing the challenges posed by uncertain user text input. |
author2 |
Chen Change Loy |
author_facet |
Chen Change Loy John Benedict, Remelia Shirlley |
format |
Final Year Project |
author |
John Benedict, Remelia Shirlley |
author_sort |
John Benedict, Remelia Shirlley |
title |
Language-guided object segmentation |
title_short |
Language-guided object segmentation |
title_full |
Language-guided object segmentation |
title_fullStr |
Language-guided object segmentation |
title_full_unstemmed |
Language-guided object segmentation |
title_sort |
language-guided object segmentation |
publisher |
Nanyang Technological University |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/175326 |
_version_ |
1806059781595594752 |