Enhancing performance in video grounding tasks through the use of captions

This report explores enhancing video grounding tasks by utilizing generated captions, addressing the challenge posed by sparse annotations in video datasets. We took inspiration from the PCNet model which uses caption-guided attention to fuse the captions generated by Parallel Dynamic Video Captioni...

Full description

Saved in:
Bibliographic Details
Main Author: Liu, Xinran
Other Authors: Sun Aixin
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/175356
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-175356
record_format dspace
spelling sg-ntu-dr.10356-1753562024-04-26T15:42:40Z Enhancing performance in video grounding tasks through the use of captions Liu, Xinran Sun Aixin School of Computer Science and Engineering AXSun@ntu.edu.sg Computer and Information Science Temporal sentence grounding Machine learning This report explores enhancing video grounding tasks by utilizing generated captions, addressing the challenge posed by sparse annotations in video datasets. We took inspiration from the PCNet model which uses caption-guided attention to fuse the captions generated by Parallel Dynamic Video Captioning (PDVC) and selected via the Non-Prompt Caption Suppression (NPCS) algorithm with feature maps to provide prior knowledge for training. Our model is also inspired by 2D-TAN model which leverages 2D temporal map to capture the temporal relations between the moments. We built our modified model upon 2D-TAN open-source codebase and ran against several popular datasets. Our approach, though not surpassing the 2D-TAN and PCNet reported accuracy, demonstrates improvements over some other benchmarks. This study underlines the potential of leveraging automatically generated captions to enrich video grounding models, as well as some limitations of the approach, paving the way for more effective multimedia content understanding. Bachelor's degree 2024-04-22T05:19:18Z 2024-04-22T05:19:18Z 2024 Final Year Project (FYP) Liu, X. (2024). Enhancing performance in video grounding tasks through the use of captions. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/175356 https://hdl.handle.net/10356/175356 en SCSE23-0664 application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
Temporal sentence grounding
Machine learning
spellingShingle Computer and Information Science
Temporal sentence grounding
Machine learning
Liu, Xinran
Enhancing performance in video grounding tasks through the use of captions
description This report explores enhancing video grounding tasks by utilizing generated captions, addressing the challenge posed by sparse annotations in video datasets. We took inspiration from the PCNet model which uses caption-guided attention to fuse the captions generated by Parallel Dynamic Video Captioning (PDVC) and selected via the Non-Prompt Caption Suppression (NPCS) algorithm with feature maps to provide prior knowledge for training. Our model is also inspired by 2D-TAN model which leverages 2D temporal map to capture the temporal relations between the moments. We built our modified model upon 2D-TAN open-source codebase and ran against several popular datasets. Our approach, though not surpassing the 2D-TAN and PCNet reported accuracy, demonstrates improvements over some other benchmarks. This study underlines the potential of leveraging automatically generated captions to enrich video grounding models, as well as some limitations of the approach, paving the way for more effective multimedia content understanding.
author2 Sun Aixin
author_facet Sun Aixin
Liu, Xinran
format Final Year Project
author Liu, Xinran
author_sort Liu, Xinran
title Enhancing performance in video grounding tasks through the use of captions
title_short Enhancing performance in video grounding tasks through the use of captions
title_full Enhancing performance in video grounding tasks through the use of captions
title_fullStr Enhancing performance in video grounding tasks through the use of captions
title_full_unstemmed Enhancing performance in video grounding tasks through the use of captions
title_sort enhancing performance in video grounding tasks through the use of captions
publisher Nanyang Technological University
publishDate 2024
url https://hdl.handle.net/10356/175356
_version_ 1806059741856661504