Compositional prompting video-language models to understand procedure in instructional videos

Instructional videos are very useful for completing complex daily tasks, which naturally contain abundant clip-narration pairs. Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then fine-tuning downstream classifiers and localizers...

Full description

Saved in:
Bibliographic Details
Main Authors: Hu, Guyue, He, Bin, Zhang, Hanwang
Other Authors: School of Computer Science and Engineering
Format: Article
Language:English
Published: 2023
Subjects:
Online Access:https://hdl.handle.net/10356/168985
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-168985
record_format dspace
spelling sg-ntu-dr.10356-1689852023-06-26T04:45:12Z Compositional prompting video-language models to understand procedure in instructional videos Hu, Guyue He, Bin Zhang, Hanwang School of Computer Science and Engineering Engineering::Computer science and engineering Prompt Learning Instructional Videos Instructional videos are very useful for completing complex daily tasks, which naturally contain abundant clip-narration pairs. Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then fine-tuning downstream classifiers and localizers in predetermined category space. These video-language models are proficient at representing short-term actions, basic objects, and their combinations, but they are still far from understanding long-term procedures. In addition, the predetermined procedure category faces the problem of combination disaster and is inherently inapt to unseen procedures. Therefore, we propose a novel compositional prompt learning (CPL) framework to understand long-term procedures by prompting short-term video-language models and reformulating several classical procedure understanding tasks into general video-text matching problems. Specifically, the proposed CPL consists of one visual prompt and three compositional textual prompts (including the action prompt, object prompt, and procedure prompt), which could compositionally distill knowledge from short-term video-language models to facilitate long-term procedure understanding. Besides, the task reformulation enables our CPL to perform well in all zero-shot, few-shot, and fully-supervised settings. Extensive experiments on two widely-used datasets for procedure understanding demonstrate the effectiveness of the proposed approach. 2023-06-26T04:45:12Z 2023-06-26T04:45:12Z 2023 Journal Article Hu, G., He, B. & Zhang, H. (2023). Compositional prompting video-language models to understand procedure in instructional videos. Machine Intelligence Research, 20(2), 249-262. https://dx.doi.org/10.1007/s11633-022-1409-1 2731-538X https://hdl.handle.net/10356/168985 10.1007/s11633-022-1409-1 2-s2.0-85149147475 2 20 249 262 en Machine Intelligence Research © Institute of Automation, Chinese Academy of Sciences and Springer-Verlag GmbH Germany, part of Springer Nature 2023.
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
Prompt Learning
Instructional Videos
spellingShingle Engineering::Computer science and engineering
Prompt Learning
Instructional Videos
Hu, Guyue
He, Bin
Zhang, Hanwang
Compositional prompting video-language models to understand procedure in instructional videos
description Instructional videos are very useful for completing complex daily tasks, which naturally contain abundant clip-narration pairs. Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then fine-tuning downstream classifiers and localizers in predetermined category space. These video-language models are proficient at representing short-term actions, basic objects, and their combinations, but they are still far from understanding long-term procedures. In addition, the predetermined procedure category faces the problem of combination disaster and is inherently inapt to unseen procedures. Therefore, we propose a novel compositional prompt learning (CPL) framework to understand long-term procedures by prompting short-term video-language models and reformulating several classical procedure understanding tasks into general video-text matching problems. Specifically, the proposed CPL consists of one visual prompt and three compositional textual prompts (including the action prompt, object prompt, and procedure prompt), which could compositionally distill knowledge from short-term video-language models to facilitate long-term procedure understanding. Besides, the task reformulation enables our CPL to perform well in all zero-shot, few-shot, and fully-supervised settings. Extensive experiments on two widely-used datasets for procedure understanding demonstrate the effectiveness of the proposed approach.
author2 School of Computer Science and Engineering
author_facet School of Computer Science and Engineering
Hu, Guyue
He, Bin
Zhang, Hanwang
format Article
author Hu, Guyue
He, Bin
Zhang, Hanwang
author_sort Hu, Guyue
title Compositional prompting video-language models to understand procedure in instructional videos
title_short Compositional prompting video-language models to understand procedure in instructional videos
title_full Compositional prompting video-language models to understand procedure in instructional videos
title_fullStr Compositional prompting video-language models to understand procedure in instructional videos
title_full_unstemmed Compositional prompting video-language models to understand procedure in instructional videos
title_sort compositional prompting video-language models to understand procedure in instructional videos
publishDate 2023
url https://hdl.handle.net/10356/168985
_version_ 1772827443485212672