Learning deep networks for video object segmentation
The Segment Anything Model (SAM) is an image segmentation model which has gained significant traction due to its powerful zero shot transfer performance on unseen data distributions as well as application to downstream tasks. It has a broad support of input methods such as point, box, and automa...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/175018 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | The Segment Anything Model (SAM) is an image segmentation model which has gained
significant traction due to its powerful zero shot transfer performance on unseen data
distributions as well as application to downstream tasks. It has a broad support of input methods such as point, box, and automatic mask generation.
Traditional Video Object Segmentation (VOS) methods require strongly labelled training data
consisting of densely annotated pixel level segmentation mask, which is both expensive and
time-consuming to obtain. We explore using only weakly labelled bounding box annotations
to turn the training process into a weakly supervised mode.
In this paper, we present a novel method BoxSAM which combines the Segment Anything
Model (SAM) with a Single object tracker and Monocular Depth mapping to tackle the task of
Video Object Segmentation (VOS). BoxSAM leverages a robust bounding box based object
tracker and point augmentation techniques from attention maps to generate an object mask,
which will then be deconflicted using depth maps.
The proposed method achieves 81.8 on DAVIS 17 and 70.5 on Youtube-VOS 2018 which
compares favourably to other methods that were not trained on video segmentation data. |
---|