Question-guided hybrid convolution for visual question answering

In this paper, we propose a novel Question-Guided Hybrid Convolution (QGHC) network for Visual Question Answering (VQA). Most state-of-the-art VQA methods fuse the high-level textual and visual features from the neural network and abandon the visual spatial information when learning multi-modal feat...

Full description

Saved in:
Bibliographic Details
Main Authors: GAO, Peng, LU, Pan, LI, Hongsheng, LI, Shuang, LI, Yikang, HOI, Steven C. H., WANG, Xiaogang
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2018
Subjects:
VQA
Online Access:https://ink.library.smu.edu.sg/sis_research/4182
https://ink.library.smu.edu.sg/context/sis_research/article/5185/viewcontent/Question_GuidedHybridConvolution_2018_afv.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-5185
record_format dspace
spelling sg-smu-ink.sis_research-51852020-03-26T07:43:21Z Question-guided hybrid convolution for visual question answering GAO, Peng LU, Pan LI, Hongsheng LI, Shuang LI, Yikang HOI, Steven C. H. WANG, Xiaogang In this paper, we propose a novel Question-Guided Hybrid Convolution (QGHC) network for Visual Question Answering (VQA). Most state-of-the-art VQA methods fuse the high-level textual and visual features from the neural network and abandon the visual spatial information when learning multi-modal features.To address these problems, question-guided kernels generated from the input question are designed to convolute with visual features for capturing the textual and visual relationship in the early stage. The question-guided convolution can tightly couple the textual and visual information but also introduce more parameters when learning kernels. We apply the group convolution, which consists of question-independent kernels and question-dependent kernels, to reduce the parameter size and alleviate over-fitting. The hybrid convolution can generate discriminative multi-modal features with fewer parameters. The proposed approach is also complementary to existing bilinear pooling fusion and attention based VQA methods. By integrating with them, our method could further boost the performance. Extensive experiments on public VQA datasets validate the effectiveness of QGHC. 2018-09-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/4182 info:doi/10.1007/978-3-030-01246-5_29 https://ink.library.smu.edu.sg/context/sis_research/article/5185/viewcontent/Question_GuidedHybridConvolution_2018_afv.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University VQA Dynamic Parameter Prediction Group Convolution Databases and Information Systems Theory and Algorithms
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic VQA
Dynamic Parameter Prediction
Group Convolution
Databases and Information Systems
Theory and Algorithms
spellingShingle VQA
Dynamic Parameter Prediction
Group Convolution
Databases and Information Systems
Theory and Algorithms
GAO, Peng
LU, Pan
LI, Hongsheng
LI, Shuang
LI, Yikang
HOI, Steven C. H.
WANG, Xiaogang
Question-guided hybrid convolution for visual question answering
description In this paper, we propose a novel Question-Guided Hybrid Convolution (QGHC) network for Visual Question Answering (VQA). Most state-of-the-art VQA methods fuse the high-level textual and visual features from the neural network and abandon the visual spatial information when learning multi-modal features.To address these problems, question-guided kernels generated from the input question are designed to convolute with visual features for capturing the textual and visual relationship in the early stage. The question-guided convolution can tightly couple the textual and visual information but also introduce more parameters when learning kernels. We apply the group convolution, which consists of question-independent kernels and question-dependent kernels, to reduce the parameter size and alleviate over-fitting. The hybrid convolution can generate discriminative multi-modal features with fewer parameters. The proposed approach is also complementary to existing bilinear pooling fusion and attention based VQA methods. By integrating with them, our method could further boost the performance. Extensive experiments on public VQA datasets validate the effectiveness of QGHC.
format text
author GAO, Peng
LU, Pan
LI, Hongsheng
LI, Shuang
LI, Yikang
HOI, Steven C. H.
WANG, Xiaogang
author_facet GAO, Peng
LU, Pan
LI, Hongsheng
LI, Shuang
LI, Yikang
HOI, Steven C. H.
WANG, Xiaogang
author_sort GAO, Peng
title Question-guided hybrid convolution for visual question answering
title_short Question-guided hybrid convolution for visual question answering
title_full Question-guided hybrid convolution for visual question answering
title_fullStr Question-guided hybrid convolution for visual question answering
title_full_unstemmed Question-guided hybrid convolution for visual question answering
title_sort question-guided hybrid convolution for visual question answering
publisher Institutional Knowledge at Singapore Management University
publishDate 2018
url https://ink.library.smu.edu.sg/sis_research/4182
https://ink.library.smu.edu.sg/context/sis_research/article/5185/viewcontent/Question_GuidedHybridConvolution_2018_afv.pdf
_version_ 1770574422970728448