Efficient cross-modal video retrieval with meta-optimized frames

Cross-modal video retrieval aims to retrieve semantically relevant videos when given a textual query, and is one of the fundamental multimedia tasks. Most top-performing methods primarily leverage Vision Transformer (ViT) to extract video features [1]-[3]. However, they suffer from the high computat...

Full description

Saved in:

Bibliographic Details
Main Authors:	HAN, Ning, YANG, Xun, LIM, Ee-peng, CHEN, Hao, SUN, Qianru
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Cross-Modal Multimodal Video Compression Video Retrieval Databases and Information Systems Graphics and Human Computer Interfaces
Online Access:	https://ink.library.smu.edu.sg/sis_research/9034 https://ink.library.smu.edu.sg/context/sis_research/article/10037/viewcontent/2210.08452v1_sv.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-10037
record_format	dspace
spelling	sg-smu-ink.sis_research-100372024-07-25T07:58:21Z Efficient cross-modal video retrieval with meta-optimized frames HAN, Ning YANG, Xun LIM, Ee-peng CHEN, Hao SUN, Qianru Cross-modal video retrieval aims to retrieve semantically relevant videos when given a textual query, and is one of the fundamental multimedia tasks. Most top-performing methods primarily leverage Vision Transformer (ViT) to extract video features [1]-[3]. However, they suffer from the high computational complexity of ViT, especially when encoding long videos. A common and simple solution is to uniformly sample a small number (e.g., 4 or 8) of frames from the target video (instead of using the whole video) as ViT inputs. The number of frames has a strong influence on the performance of ViT, e.g., using 8 frames yields better performance than using 4 frames but requires more computational resources, resulting in a trade-off. To get free from this trade-off, this paper introduces an automatic video compression method based on a bilevel optimization program (BOP) consisting of both model-level (i.e., base-level) and frame-level (i.e., meta-level) optimizations. The model-level optimization process learns a cross-modal video retrieval model whose input includes the “compressed frames” learned by framelevel optimization. In turn, frame-level optimization is achieved through gradient descent using the meta loss of the video retrieval model computed on the whole video. We call this BOP method (as well as the “compressed frames”) the Meta-Optimized Frames (MOF) approach. By incorporating MOF, the video retrieval model is able to utilize the information of whole videos (for training) while taking only a small number of input frames in its actual implementation. The convergence of MOF is guaranteed by meta gradient descent algorithms. For evaluation purposes, we conduct extensive cross-modal video retrieval experiments on three large-scale benchmarks: MSR-VTT, MSVD, and DiDeMo. Our results show that MOF is a generic and efficient method that boost multiple baseline methods, and can achieve a new state-of-the-art performance. Our code is publicly available at: https://github.com/lionel-hing/MOF. 2024-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9034 info:doi/10.1109/TMM.2024.3416669 https://ink.library.smu.edu.sg/context/sis_research/article/10037/viewcontent/2210.08452v1_sv.pdf http://creativecommons.org/licenses/by/3.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Cross-Modal Multimodal Video Compression Video Retrieval Databases and Information Systems Graphics and Human Computer Interfaces
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Cross-Modal Multimodal Video Compression Video Retrieval Databases and Information Systems Graphics and Human Computer Interfaces
spellingShingle	Cross-Modal Multimodal Video Compression Video Retrieval Databases and Information Systems Graphics and Human Computer Interfaces HAN, Ning YANG, Xun LIM, Ee-peng CHEN, Hao SUN, Qianru Efficient cross-modal video retrieval with meta-optimized frames
description	Cross-modal video retrieval aims to retrieve semantically relevant videos when given a textual query, and is one of the fundamental multimedia tasks. Most top-performing methods primarily leverage Vision Transformer (ViT) to extract video features [1]-[3]. However, they suffer from the high computational complexity of ViT, especially when encoding long videos. A common and simple solution is to uniformly sample a small number (e.g., 4 or 8) of frames from the target video (instead of using the whole video) as ViT inputs. The number of frames has a strong influence on the performance of ViT, e.g., using 8 frames yields better performance than using 4 frames but requires more computational resources, resulting in a trade-off. To get free from this trade-off, this paper introduces an automatic video compression method based on a bilevel optimization program (BOP) consisting of both model-level (i.e., base-level) and frame-level (i.e., meta-level) optimizations. The model-level optimization process learns a cross-modal video retrieval model whose input includes the “compressed frames” learned by framelevel optimization. In turn, frame-level optimization is achieved through gradient descent using the meta loss of the video retrieval model computed on the whole video. We call this BOP method (as well as the “compressed frames”) the Meta-Optimized Frames (MOF) approach. By incorporating MOF, the video retrieval model is able to utilize the information of whole videos (for training) while taking only a small number of input frames in its actual implementation. The convergence of MOF is guaranteed by meta gradient descent algorithms. For evaluation purposes, we conduct extensive cross-modal video retrieval experiments on three large-scale benchmarks: MSR-VTT, MSVD, and DiDeMo. Our results show that MOF is a generic and efficient method that boost multiple baseline methods, and can achieve a new state-of-the-art performance. Our code is publicly available at: https://github.com/lionel-hing/MOF.
format	text
author	HAN, Ning YANG, Xun LIM, Ee-peng CHEN, Hao SUN, Qianru
author_facet	HAN, Ning YANG, Xun LIM, Ee-peng CHEN, Hao SUN, Qianru
author_sort	HAN, Ning
title	Efficient cross-modal video retrieval with meta-optimized frames
title_short	Efficient cross-modal video retrieval with meta-optimized frames
title_full	Efficient cross-modal video retrieval with meta-optimized frames
title_fullStr	Efficient cross-modal video retrieval with meta-optimized frames
title_full_unstemmed	Efficient cross-modal video retrieval with meta-optimized frames
title_sort	efficient cross-modal video retrieval with meta-optimized frames
publisher	Institutional Knowledge at Singapore Management University
publishDate	2024
url	https://ink.library.smu.edu.sg/sis_research/9034 https://ink.library.smu.edu.sg/context/sis_research/article/10037/viewcontent/2210.08452v1_sv.pdf
_version_	1814047713467039744

Efficient cross-modal video retrieval with meta-optimized frames

Similar Items