Contrastive video question answering via video graph transformer

We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT’s uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their rel...

Full description

Saved in:

Bibliographic Details
Main Authors:	XIAO, Junbin Xiao, ZHOU, Pan, YAO, Angela, LI, Yicong, HONG, Richang, YAN, Shuicheng, CHUA, Tat-Seng
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2023
Subjects:	VideoQA Cross-Modal Visual Reasoning Video-Language Dynamic Visual Graphs Contrastive Learning Transformer Graphics and Human Computer Interfaces
Online Access:	https://ink.library.smu.edu.sg/sis_research/9053 https://ink.library.smu.edu.sg/context/sis_research/article/10056/viewcontent/2023_TPAMI_ContrastiveVideo.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-10056
record_format	dspace
spelling	sg-smu-ink.sis_research-100562024-08-01T15:38:43Z Contrastive video question answering via video graph transformer XIAO, Junbin Xiao ZHOU, Pan YAO, Angela LI, Yicong HONG, Richang YAN, Shuicheng CHUA, Tat-Seng We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT’s uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully- and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining. We hope our success can advance VideoQA beyond coarse recognition/description towards fine-grained relation reasoning of video contents. Our code is available at https://github.com/doc-doc/CoVGT. 2023-07-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9053 info:doi/10.1109/TPAMI.2023.3292266 https://ink.library.smu.edu.sg/context/sis_research/article/10056/viewcontent/2023_TPAMI_ContrastiveVideo.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University VideoQA Cross-Modal Visual Reasoning Video-Language Dynamic Visual Graphs Contrastive Learning Transformer Graphics and Human Computer Interfaces
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	VideoQA Cross-Modal Visual Reasoning Video-Language Dynamic Visual Graphs Contrastive Learning Transformer Graphics and Human Computer Interfaces
spellingShingle	VideoQA Cross-Modal Visual Reasoning Video-Language Dynamic Visual Graphs Contrastive Learning Transformer Graphics and Human Computer Interfaces XIAO, Junbin Xiao ZHOU, Pan YAO, Angela LI, Yicong HONG, Richang YAN, Shuicheng CHUA, Tat-Seng Contrastive video question answering via video graph transformer
description	We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT’s uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully- and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining. We hope our success can advance VideoQA beyond coarse recognition/description towards fine-grained relation reasoning of video contents. Our code is available at https://github.com/doc-doc/CoVGT.
format	text
author	XIAO, Junbin Xiao ZHOU, Pan YAO, Angela LI, Yicong HONG, Richang YAN, Shuicheng CHUA, Tat-Seng
author_facet	XIAO, Junbin Xiao ZHOU, Pan YAO, Angela LI, Yicong HONG, Richang YAN, Shuicheng CHUA, Tat-Seng
author_sort	XIAO, Junbin Xiao
title	Contrastive video question answering via video graph transformer
title_short	Contrastive video question answering via video graph transformer
title_full	Contrastive video question answering via video graph transformer
title_fullStr	Contrastive video question answering via video graph transformer
title_full_unstemmed	Contrastive video question answering via video graph transformer
title_sort	contrastive video question answering via video graph transformer
publisher	Institutional Knowledge at Singapore Management University
publishDate	2023
url	https://ink.library.smu.edu.sg/sis_research/9053 https://ink.library.smu.edu.sg/context/sis_research/article/10056/viewcontent/2023_TPAMI_ContrastiveVideo.pdf
_version_	1814047718509641728

Contrastive video question answering via video graph transformer

Similar Items