Improving interpretable embeddings for ad-hoc video search with generative captions and multi-word concept bank

Aligning a user query and video clips in cross-modal latent space and that with semantic concepts are two mainstream approaches for ad-hoc video search (AVS). However, the effectiveness of existing approaches is bottlenecked by the small sizes of available video-text datasets and the low quality of...

Full description

Saved in:

Bibliographic Details
Main Authors:	WU, Jiaxin, NGO, Chong-wah, CHAN, Wing-Kwong
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Ad-hoc video search Interpretable embedding Large-scale videotext dataset Concept bank construction Out of vocabulary Databases and Information Systems Graphics and Human Computer Interfaces
Online Access:	https://ink.library.smu.edu.sg/sis_research/9288 https://ink.library.smu.edu.sg/context/sis_research/article/10288/viewcontent/2404.06173v1.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-10288
record_format	dspace
spelling	sg-smu-ink.sis_research-102882024-09-13T14:36:35Z Improving interpretable embeddings for ad-hoc video search with generative captions and multi-word concept bank WU, Jiaxin NGO, Chong-wah CHAN, Wing-Kwong Aligning a user query and video clips in cross-modal latent space and that with semantic concepts are two mainstream approaches for ad-hoc video search (AVS). However, the effectiveness of existing approaches is bottlenecked by the small sizes of available video-text datasets and the low quality of concept banks, which results in the failures of unseen queries and the out-of-vocabulary problem. This paper addresses these two problems by constructing a new dataset and developing a multi-word concept bank. Specifically, capitalizing on a generative model, we construct a new dataset consisting of 7 million generated text and video pairs for pre-training. To tackle the out-of-vocabulary problem, we develop a multi-word concept bank based on syntax analysis to enhance the capability of a state-of-the- art interpretable AVS method in modelling relationships between query words. We also study the impact of current advanced features on the method. Experimental results show that the integration of the above-proposed elements doubles the R@1 performance of the AVS method on the MSRVTT dataset and improves the xinfAP on the TRECVid AVS query sets for 2016-2023 (eight years) by a margin from 2% to 77%, with an average about 20%. The code and model are available at https://github.com/nikkiwoo-gh/Improved-ITV. 2024-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9288 info:doi/10.1145/3652583.3658052 https://ink.library.smu.edu.sg/context/sis_research/article/10288/viewcontent/2404.06173v1.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Ad-hoc video search Interpretable embedding Large-scale videotext dataset Concept bank construction Out of vocabulary Databases and Information Systems Graphics and Human Computer Interfaces
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Ad-hoc video search Interpretable embedding Large-scale videotext dataset Concept bank construction Out of vocabulary Databases and Information Systems Graphics and Human Computer Interfaces
spellingShingle	Ad-hoc video search Interpretable embedding Large-scale videotext dataset Concept bank construction Out of vocabulary Databases and Information Systems Graphics and Human Computer Interfaces WU, Jiaxin NGO, Chong-wah CHAN, Wing-Kwong Improving interpretable embeddings for ad-hoc video search with generative captions and multi-word concept bank
description	Aligning a user query and video clips in cross-modal latent space and that with semantic concepts are two mainstream approaches for ad-hoc video search (AVS). However, the effectiveness of existing approaches is bottlenecked by the small sizes of available video-text datasets and the low quality of concept banks, which results in the failures of unseen queries and the out-of-vocabulary problem. This paper addresses these two problems by constructing a new dataset and developing a multi-word concept bank. Specifically, capitalizing on a generative model, we construct a new dataset consisting of 7 million generated text and video pairs for pre-training. To tackle the out-of-vocabulary problem, we develop a multi-word concept bank based on syntax analysis to enhance the capability of a state-of-the- art interpretable AVS method in modelling relationships between query words. We also study the impact of current advanced features on the method. Experimental results show that the integration of the above-proposed elements doubles the R@1 performance of the AVS method on the MSRVTT dataset and improves the xinfAP on the TRECVid AVS query sets for 2016-2023 (eight years) by a margin from 2% to 77%, with an average about 20%. The code and model are available at https://github.com/nikkiwoo-gh/Improved-ITV.
format	text
author	WU, Jiaxin NGO, Chong-wah CHAN, Wing-Kwong
author_facet	WU, Jiaxin NGO, Chong-wah CHAN, Wing-Kwong
author_sort	WU, Jiaxin
title	Improving interpretable embeddings for ad-hoc video search with generative captions and multi-word concept bank
title_short	Improving interpretable embeddings for ad-hoc video search with generative captions and multi-word concept bank
title_full	Improving interpretable embeddings for ad-hoc video search with generative captions and multi-word concept bank
title_fullStr	Improving interpretable embeddings for ad-hoc video search with generative captions and multi-word concept bank
title_full_unstemmed	Improving interpretable embeddings for ad-hoc video search with generative captions and multi-word concept bank
title_sort	improving interpretable embeddings for ad-hoc video search with generative captions and multi-word concept bank
publisher	Institutional Knowledge at Singapore Management University
publishDate	2024
url	https://ink.library.smu.edu.sg/sis_research/9288 https://ink.library.smu.edu.sg/context/sis_research/article/10288/viewcontent/2404.06173v1.pdf
_version_	1814047873704132608

Improving interpretable embeddings for ad-hoc video search with generative captions and multi-word concept bank

Similar Items