Visual Commonsense R-CNN

We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA. Given a set of detected object regions in an image (e.g., us...

Full description

Saved in:
Bibliographic Details
Main Authors: WANG, Tan, HUANG, Jianqiang, ZHANG, Hanwang, SUN, Qianru
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2020
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/5592
https://ink.library.smu.edu.sg/context/sis_research/article/6595/viewcontent/CVPR2020_VC_R_CNN.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-6595
record_format dspace
spelling sg-smu-ink.sis_research-65952021-01-07T14:00:46Z Visual Commonsense R-CNN WANG, Tan HUANG, Jianqiang ZHANG, Hanwang SUN, Qianru We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA. Given a set of detected object regions in an image (e.g., using Faster R-CNN), like any other unsupervised feature learning methods (e.g., word2vec), the proxy training objective of VC R-CNN is to predict the contextual objects of a region. However, they are fundamentally different: the prediction of VC R-CNN is by using causal intervention: P(Y|do(X)), while others are by using the conventional likelihood: P(Y|X). This is also the core reason why VC R-CNN can learn ``sense-making'' knowledge like chair can be sat --- while not just "common'' co-occurrences such as chair is likely to exist if table is observed. We extensively apply VC R-CNN features in prevailing models of three popular tasks: Image Captioning, VQA, and VCR, and observe consistent performance boosts across them, achieving many new state-of-the-arts. 2020-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/5592 info:doi/10.1109/CVPR42600.2020.01077 https://ink.library.smu.edu.sg/context/sis_research/article/6595/viewcontent/CVPR2020_VC_R_CNN.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Artificial Intelligence and Robotics Graphics and Human Computer Interfaces
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Artificial Intelligence and Robotics
Graphics and Human Computer Interfaces
spellingShingle Artificial Intelligence and Robotics
Graphics and Human Computer Interfaces
WANG, Tan
HUANG, Jianqiang
ZHANG, Hanwang
SUN, Qianru
Visual Commonsense R-CNN
description We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA. Given a set of detected object regions in an image (e.g., using Faster R-CNN), like any other unsupervised feature learning methods (e.g., word2vec), the proxy training objective of VC R-CNN is to predict the contextual objects of a region. However, they are fundamentally different: the prediction of VC R-CNN is by using causal intervention: P(Y|do(X)), while others are by using the conventional likelihood: P(Y|X). This is also the core reason why VC R-CNN can learn ``sense-making'' knowledge like chair can be sat --- while not just "common'' co-occurrences such as chair is likely to exist if table is observed. We extensively apply VC R-CNN features in prevailing models of three popular tasks: Image Captioning, VQA, and VCR, and observe consistent performance boosts across them, achieving many new state-of-the-arts.
format text
author WANG, Tan
HUANG, Jianqiang
ZHANG, Hanwang
SUN, Qianru
author_facet WANG, Tan
HUANG, Jianqiang
ZHANG, Hanwang
SUN, Qianru
author_sort WANG, Tan
title Visual Commonsense R-CNN
title_short Visual Commonsense R-CNN
title_full Visual Commonsense R-CNN
title_fullStr Visual Commonsense R-CNN
title_full_unstemmed Visual Commonsense R-CNN
title_sort visual commonsense r-cnn
publisher Institutional Knowledge at Singapore Management University
publishDate 2020
url https://ink.library.smu.edu.sg/sis_research/5592
https://ink.library.smu.edu.sg/context/sis_research/article/6595/viewcontent/CVPR2020_VC_R_CNN.pdf
_version_ 1770575520881180672