Interactive video search with multi-modal LLM video captioning

Cross-modal representation learning is essential for interactive text-to-video search tasks. However, the representation learning is limited by the size and quality of video-caption pairs. To improve the search accuracy, we propose to enlarge the size of available video-caption pairs by leveraging m...

Full description

Saved in:
Bibliographic Details
Main Authors: CHENG, Yu-Tong, WU, Jiaxin, MA, Zhixin, HE, Jiangshan, WEI, Xiao-Yong, NGO, Chong-wah
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2025
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/10105
https://ink.library.smu.edu.sg/context/sis_research/article/11105/viewcontent/InteractiveVideo_LLM_av.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-11105
record_format dspace
spelling sg-smu-ink.sis_research-111052025-02-19T03:24:07Z Interactive video search with multi-modal LLM video captioning CHENG, Yu-Tong WU, Jiaxin MA, Zhixin HE, Jiangshan WEI, Xiao-Yong NGO, Chong-wah Cross-modal representation learning is essential for interactive text-to-video search tasks. However, the representation learning is limited by the size and quality of video-caption pairs. To improve the search accuracy, we propose to enlarge the size of available video-caption pairs by leveraging multi-model LLM on video captioning. Specifically, we use LLM to generate video captions for a large video collection (i.e., WebVid dataset) and use the generated video-caption pairs to pre-train a text-to-video search model. Additionally, we use LLM to generate fine-grained captions for test video collections to enable text-to-caption retrieval. Furthermore, we build a semantic overview of the retrieved rank list based on the detailed captions in our interactive video retrieval system which act as hints for user to refine their query. Experimental results show that the generated captions are effective in improving the search accuracy of both AVS and T-KIS tasks on the TRECVid datasets. 2025-01-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/10105 info:doi/10.1007/978-981-96-2074-6_36 https://ink.library.smu.edu.sg/context/sis_research/article/11105/viewcontent/InteractiveVideo_LLM_av.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Interactive Video Retrieval Multi-modal LLM Video Captioning Artificial Intelligence and Robotics Databases and Information Systems
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Interactive Video Retrieval
Multi-modal LLM
Video Captioning
Artificial Intelligence and Robotics
Databases and Information Systems
spellingShingle Interactive Video Retrieval
Multi-modal LLM
Video Captioning
Artificial Intelligence and Robotics
Databases and Information Systems
CHENG, Yu-Tong
WU, Jiaxin
MA, Zhixin
HE, Jiangshan
WEI, Xiao-Yong
NGO, Chong-wah
Interactive video search with multi-modal LLM video captioning
description Cross-modal representation learning is essential for interactive text-to-video search tasks. However, the representation learning is limited by the size and quality of video-caption pairs. To improve the search accuracy, we propose to enlarge the size of available video-caption pairs by leveraging multi-model LLM on video captioning. Specifically, we use LLM to generate video captions for a large video collection (i.e., WebVid dataset) and use the generated video-caption pairs to pre-train a text-to-video search model. Additionally, we use LLM to generate fine-grained captions for test video collections to enable text-to-caption retrieval. Furthermore, we build a semantic overview of the retrieved rank list based on the detailed captions in our interactive video retrieval system which act as hints for user to refine their query. Experimental results show that the generated captions are effective in improving the search accuracy of both AVS and T-KIS tasks on the TRECVid datasets.
format text
author CHENG, Yu-Tong
WU, Jiaxin
MA, Zhixin
HE, Jiangshan
WEI, Xiao-Yong
NGO, Chong-wah
author_facet CHENG, Yu-Tong
WU, Jiaxin
MA, Zhixin
HE, Jiangshan
WEI, Xiao-Yong
NGO, Chong-wah
author_sort CHENG, Yu-Tong
title Interactive video search with multi-modal LLM video captioning
title_short Interactive video search with multi-modal LLM video captioning
title_full Interactive video search with multi-modal LLM video captioning
title_fullStr Interactive video search with multi-modal LLM video captioning
title_full_unstemmed Interactive video search with multi-modal LLM video captioning
title_sort interactive video search with multi-modal llm video captioning
publisher Institutional Knowledge at Singapore Management University
publishDate 2025
url https://ink.library.smu.edu.sg/sis_research/10105
https://ink.library.smu.edu.sg/context/sis_research/article/11105/viewcontent/InteractiveVideo_LLM_av.pdf
_version_ 1827070768072097792