Compositional prompting video-language models to understand procedure in instructional videos

Instructional videos are very useful for completing complex daily tasks, which naturally contain abundant clip-narration pairs. Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then fine-tuning downstream classifiers and localizers...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hu, Guyue, He, Bin, Zhang, Hanwang
Other Authors:	School of Computer Science and Engineering
Format:	Article
Language:	English
Published:	2023
Subjects:	Engineering::Computer science and engineering Prompt Learning Instructional Videos
Online Access:	https://hdl.handle.net/10356/168985
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-168985
record_format	dspace
spelling	sg-ntu-dr.10356-1689852023-06-26T04:45:12Z Compositional prompting video-language models to understand procedure in instructional videos Hu, Guyue He, Bin Zhang, Hanwang School of Computer Science and Engineering Engineering::Computer science and engineering Prompt Learning Instructional Videos Instructional videos are very useful for completing complex daily tasks, which naturally contain abundant clip-narration pairs. Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then fine-tuning downstream classifiers and localizers in predetermined category space. These video-language models are proficient at representing short-term actions, basic objects, and their combinations, but they are still far from understanding long-term procedures. In addition, the predetermined procedure category faces the problem of combination disaster and is inherently inapt to unseen procedures. Therefore, we propose a novel compositional prompt learning (CPL) framework to understand long-term procedures by prompting short-term video-language models and reformulating several classical procedure understanding tasks into general video-text matching problems. Specifically, the proposed CPL consists of one visual prompt and three compositional textual prompts (including the action prompt, object prompt, and procedure prompt), which could compositionally distill knowledge from short-term video-language models to facilitate long-term procedure understanding. Besides, the task reformulation enables our CPL to perform well in all zero-shot, few-shot, and fully-supervised settings. Extensive experiments on two widely-used datasets for procedure understanding demonstrate the effectiveness of the proposed approach. 2023-06-26T04:45:12Z 2023-06-26T04:45:12Z 2023 Journal Article Hu, G., He, B. & Zhang, H. (2023). Compositional prompting video-language models to understand procedure in instructional videos. Machine Intelligence Research, 20(2), 249-262. https://dx.doi.org/10.1007/s11633-022-1409-1 2731-538X https://hdl.handle.net/10356/168985 10.1007/s11633-022-1409-1 2-s2.0-85149147475 2 20 249 262 en Machine Intelligence Research © Institute of Automation, Chinese Academy of Sciences and Springer-Verlag GmbH Germany, part of Springer Nature 2023.
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering Prompt Learning Instructional Videos
spellingShingle	Engineering::Computer science and engineering Prompt Learning Instructional Videos Hu, Guyue He, Bin Zhang, Hanwang Compositional prompting video-language models to understand procedure in instructional videos
description	Instructional videos are very useful for completing complex daily tasks, which naturally contain abundant clip-narration pairs. Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then fine-tuning downstream classifiers and localizers in predetermined category space. These video-language models are proficient at representing short-term actions, basic objects, and their combinations, but they are still far from understanding long-term procedures. In addition, the predetermined procedure category faces the problem of combination disaster and is inherently inapt to unseen procedures. Therefore, we propose a novel compositional prompt learning (CPL) framework to understand long-term procedures by prompting short-term video-language models and reformulating several classical procedure understanding tasks into general video-text matching problems. Specifically, the proposed CPL consists of one visual prompt and three compositional textual prompts (including the action prompt, object prompt, and procedure prompt), which could compositionally distill knowledge from short-term video-language models to facilitate long-term procedure understanding. Besides, the task reformulation enables our CPL to perform well in all zero-shot, few-shot, and fully-supervised settings. Extensive experiments on two widely-used datasets for procedure understanding demonstrate the effectiveness of the proposed approach.
author2	School of Computer Science and Engineering
author_facet	School of Computer Science and Engineering Hu, Guyue He, Bin Zhang, Hanwang
format	Article
author	Hu, Guyue He, Bin Zhang, Hanwang
author_sort	Hu, Guyue
title	Compositional prompting video-language models to understand procedure in instructional videos
title_short	Compositional prompting video-language models to understand procedure in instructional videos
title_full	Compositional prompting video-language models to understand procedure in instructional videos
title_fullStr	Compositional prompting video-language models to understand procedure in instructional videos
title_full_unstemmed	Compositional prompting video-language models to understand procedure in instructional videos
title_sort	compositional prompting video-language models to understand procedure in instructional videos
publishDate	2023
url	https://hdl.handle.net/10356/168985
_version_	1772827443485212672

Compositional prompting video-language models to understand procedure in instructional videos

Similar Items