Action-centric relation transformer network for video question answering

Video question answering (VideoQA) has emerged as a popular research topic in recent years. Enormous efforts have been devoted to developing more effective fusion strategies and better intra-modal feature preparation. To explore these issues further, we identify two key problems. (1) Current works t...

Full description

Saved in:

Bibliographic Details
Main Authors:	ZHANG, Jipeng, SHAO, Jie, CAO, Rui, GAO, Lianli, XU, Xing, SHEN, Heng Tao
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2022
Subjects:	Knowledge discovery multi-modal reasoning Proposals relation reasoning Task analysis temporal action detection Video question answering video representation Visualization Cognition Encoding Feature extraction Broadcast and Video Studies Databases and Information Systems Numerical Analysis and Scientific Computing
Online Access:	https://ink.library.smu.edu.sg/sis_research/6020 https://ink.library.smu.edu.sg/context/sis_research/article/7023/viewcontent/Action_Centric_Relation_Video_Question_Answering_av.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-7023
record_format	dspace
spelling	sg-smu-ink.sis_research-70232022-02-16T01:04:47Z Action-centric relation transformer network for video question answering ZHANG, Jipeng SHAO, Jie CAO, Rui GAO, Lianli XU, Xing SHEN, Heng Tao Video question answering (VideoQA) has emerged as a popular research topic in recent years. Enormous efforts have been devoted to developing more effective fusion strategies and better intra-modal feature preparation. To explore these issues further, we identify two key problems. (1) Current works take almost no account of introducing action of interest in video representation. Additionally, there exists insufficient labeling data on where the action of interest is in many datasets. However, questions in VideoQA are usually action-centric. (2) Frame-to-frame relations, which can provide useful temporal attributes (e.g., state transition, action counting), lack relevant research. Based on these observations, we propose an action-centric relation transformer network (ACRTransformer) for VideoQA and make two significant improvements. (1) We explicitly consider the action recognition problem and present a visual feature encoding technique, action-based encoding (ABE), to emphasize the frames with high actionness probabilities (the probability that the frame has actions). (2) We better exploit the interplays between temporal frames using a relation transformer network (RTransformer). Experiments on popular benchmark datasets in VideoQA clearly establish our superiority over previous state-of-the-art models. Code could be found at https://github.com/op-multimodal/ACRTransformer. 2022-01-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/6020 info:doi/10.1109/TCSVT.2020.3048440 https://ink.library.smu.edu.sg/context/sis_research/article/7023/viewcontent/Action_Centric_Relation_Video_Question_Answering_av.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Knowledge discovery multi-modal reasoning Proposals relation reasoning Task analysis temporal action detection Video question answering video representation Visualization Cognition Encoding Feature extraction Broadcast and Video Studies Databases and Information Systems Numerical Analysis and Scientific Computing
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Knowledge discovery multi-modal reasoning Proposals relation reasoning Task analysis temporal action detection Video question answering video representation Visualization Cognition Encoding Feature extraction Broadcast and Video Studies Databases and Information Systems Numerical Analysis and Scientific Computing
spellingShingle	Knowledge discovery multi-modal reasoning Proposals relation reasoning Task analysis temporal action detection Video question answering video representation Visualization Cognition Encoding Feature extraction Broadcast and Video Studies Databases and Information Systems Numerical Analysis and Scientific Computing ZHANG, Jipeng SHAO, Jie CAO, Rui GAO, Lianli XU, Xing SHEN, Heng Tao Action-centric relation transformer network for video question answering
description	Video question answering (VideoQA) has emerged as a popular research topic in recent years. Enormous efforts have been devoted to developing more effective fusion strategies and better intra-modal feature preparation. To explore these issues further, we identify two key problems. (1) Current works take almost no account of introducing action of interest in video representation. Additionally, there exists insufficient labeling data on where the action of interest is in many datasets. However, questions in VideoQA are usually action-centric. (2) Frame-to-frame relations, which can provide useful temporal attributes (e.g., state transition, action counting), lack relevant research. Based on these observations, we propose an action-centric relation transformer network (ACRTransformer) for VideoQA and make two significant improvements. (1) We explicitly consider the action recognition problem and present a visual feature encoding technique, action-based encoding (ABE), to emphasize the frames with high actionness probabilities (the probability that the frame has actions). (2) We better exploit the interplays between temporal frames using a relation transformer network (RTransformer). Experiments on popular benchmark datasets in VideoQA clearly establish our superiority over previous state-of-the-art models. Code could be found at https://github.com/op-multimodal/ACRTransformer.
format	text
author	ZHANG, Jipeng SHAO, Jie CAO, Rui GAO, Lianli XU, Xing SHEN, Heng Tao
author_facet	ZHANG, Jipeng SHAO, Jie CAO, Rui GAO, Lianli XU, Xing SHEN, Heng Tao
author_sort	ZHANG, Jipeng
title	Action-centric relation transformer network for video question answering
title_short	Action-centric relation transformer network for video question answering
title_full	Action-centric relation transformer network for video question answering
title_fullStr	Action-centric relation transformer network for video question answering
title_full_unstemmed	Action-centric relation transformer network for video question answering
title_sort	action-centric relation transformer network for video question answering
publisher	Institutional Knowledge at Singapore Management University
publishDate	2022
url	https://ink.library.smu.edu.sg/sis_research/6020 https://ink.library.smu.edu.sg/context/sis_research/article/7023/viewcontent/Action_Centric_Relation_Video_Question_Answering_av.pdf
_version_	1770575740201336832

Action-centric relation transformer network for video question answering

Similar Items