Visual commonsense representation learning via causal inference

We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA. Given a set of detected object regions in an image (e.g., us...

Full description

Saved in:
Bibliographic Details
Main Authors: WANG, Tan, HUANG, Jianqiang, ZHANG, Hanwang, SUN, Qianru
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2020
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/5598
https://ink.library.smu.edu.sg/context/sis_research/article/6601/viewcontent/Wang_Visual_Commonsense_Representation_Learning_via_Causal_Inference_CVPRW_2020_paper.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-6601
record_format dspace
spelling sg-smu-ink.sis_research-66012021-01-07T13:56:17Z Visual commonsense representation learning via causal inference WANG, Tan HUANG, Jianqiang ZHANG, Hanwang SUN, Qianru We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA. Given a set of detected object regions in an image (e.g., using Faster R-CNN), like any other unsupervised feature learning methods (e.g., word2vec), the proxy training objective of VC R-CNN is to predict the con-textual objects of a region. However, they are fundamentally different: the prediction of VC R-CNN is by using causal intervention: P(Y|do(X)), while others are by using the conventional likelihood: P(Y|X). We extensively apply VC R-CNN features in prevailing models of two popular tasks: Image Captioning and VQA, and observe consistent performance boosts across all the methods, achieving many new state-of-the-arts. Code and feature are available at https://github.com/Wangt-CN/VC-R-CNN. For better clarity, you can also refer to the full version of this paper in [11]. 2020-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/5598 info:doi/10.1109/CVPRW50498.2020.00197 https://ink.library.smu.edu.sg/context/sis_research/article/6601/viewcontent/Wang_Visual_Commonsense_Representation_Learning_via_Causal_Inference_CVPRW_2020_paper.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Databases and Information Systems Graphics and Human Computer Interfaces
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Databases and Information Systems
Graphics and Human Computer Interfaces
spellingShingle Databases and Information Systems
Graphics and Human Computer Interfaces
WANG, Tan
HUANG, Jianqiang
ZHANG, Hanwang
SUN, Qianru
Visual commonsense representation learning via causal inference
description We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA. Given a set of detected object regions in an image (e.g., using Faster R-CNN), like any other unsupervised feature learning methods (e.g., word2vec), the proxy training objective of VC R-CNN is to predict the con-textual objects of a region. However, they are fundamentally different: the prediction of VC R-CNN is by using causal intervention: P(Y|do(X)), while others are by using the conventional likelihood: P(Y|X). We extensively apply VC R-CNN features in prevailing models of two popular tasks: Image Captioning and VQA, and observe consistent performance boosts across all the methods, achieving many new state-of-the-arts. Code and feature are available at https://github.com/Wangt-CN/VC-R-CNN. For better clarity, you can also refer to the full version of this paper in [11].
format text
author WANG, Tan
HUANG, Jianqiang
ZHANG, Hanwang
SUN, Qianru
author_facet WANG, Tan
HUANG, Jianqiang
ZHANG, Hanwang
SUN, Qianru
author_sort WANG, Tan
title Visual commonsense representation learning via causal inference
title_short Visual commonsense representation learning via causal inference
title_full Visual commonsense representation learning via causal inference
title_fullStr Visual commonsense representation learning via causal inference
title_full_unstemmed Visual commonsense representation learning via causal inference
title_sort visual commonsense representation learning via causal inference
publisher Institutional Knowledge at Singapore Management University
publishDate 2020
url https://ink.library.smu.edu.sg/sis_research/5598
https://ink.library.smu.edu.sg/context/sis_research/article/6601/viewcontent/Wang_Visual_Commonsense_Representation_Learning_via_Causal_Inference_CVPRW_2020_paper.pdf
_version_ 1770575523983917056