Visual Commonsense R-CNN

We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA. Given a set of detected object regions in an image (e.g., us...

Full description

Saved in:

Bibliographic Details
Main Authors:	WANG, Tan, HUANG, Jianqiang, ZHANG, Hanwang, SUN, Qianru
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2020
Subjects:	Artificial Intelligence and Robotics Graphics and Human Computer Interfaces
Online Access:	https://ink.library.smu.edu.sg/sis_research/5592 https://ink.library.smu.edu.sg/context/sis_research/article/6595/viewcontent/CVPR2020_VC_R_CNN.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-6595
record_format	dspace
spelling	sg-smu-ink.sis_research-65952021-01-07T14:00:46Z Visual Commonsense R-CNN WANG, Tan HUANG, Jianqiang ZHANG, Hanwang SUN, Qianru We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA. Given a set of detected object regions in an image (e.g., using Faster R-CNN), like any other unsupervised feature learning methods (e.g., word2vec), the proxy training objective of VC R-CNN is to predict the contextual objects of a region. However, they are fundamentally different: the prediction of VC R-CNN is by using causal intervention: P(Y\|do(X)), while others are by using the conventional likelihood: P(Y\|X). This is also the core reason why VC R-CNN can learn ``sense-making'' knowledge like chair can be sat --- while not just "common'' co-occurrences such as chair is likely to exist if table is observed. We extensively apply VC R-CNN features in prevailing models of three popular tasks: Image Captioning, VQA, and VCR, and observe consistent performance boosts across them, achieving many new state-of-the-arts. 2020-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/5592 info:doi/10.1109/CVPR42600.2020.01077 https://ink.library.smu.edu.sg/context/sis_research/article/6595/viewcontent/CVPR2020_VC_R_CNN.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Artificial Intelligence and Robotics Graphics and Human Computer Interfaces
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Artificial Intelligence and Robotics Graphics and Human Computer Interfaces
spellingShingle	Artificial Intelligence and Robotics Graphics and Human Computer Interfaces WANG, Tan HUANG, Jianqiang ZHANG, Hanwang SUN, Qianru Visual Commonsense R-CNN
description	We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA. Given a set of detected object regions in an image (e.g., using Faster R-CNN), like any other unsupervised feature learning methods (e.g., word2vec), the proxy training objective of VC R-CNN is to predict the contextual objects of a region. However, they are fundamentally different: the prediction of VC R-CNN is by using causal intervention: P(Y\|do(X)), while others are by using the conventional likelihood: P(Y\|X). This is also the core reason why VC R-CNN can learn ``sense-making'' knowledge like chair can be sat --- while not just "common'' co-occurrences such as chair is likely to exist if table is observed. We extensively apply VC R-CNN features in prevailing models of three popular tasks: Image Captioning, VQA, and VCR, and observe consistent performance boosts across them, achieving many new state-of-the-arts.
format	text
author	WANG, Tan HUANG, Jianqiang ZHANG, Hanwang SUN, Qianru
author_facet	WANG, Tan HUANG, Jianqiang ZHANG, Hanwang SUN, Qianru
author_sort	WANG, Tan
title	Visual Commonsense R-CNN
title_short	Visual Commonsense R-CNN
title_full	Visual Commonsense R-CNN
title_fullStr	Visual Commonsense R-CNN
title_full_unstemmed	Visual Commonsense R-CNN
title_sort	visual commonsense r-cnn
publisher	Institutional Knowledge at Singapore Management University
publishDate	2020
url	https://ink.library.smu.edu.sg/sis_research/5592 https://ink.library.smu.edu.sg/context/sis_research/article/6595/viewcontent/CVPR2020_VC_R_CNN.pdf
_version_	1770575520881180672

Visual Commonsense R-CNN

Similar Items