Stacked attention networks for referring expressions comprehension

Referring expressions comprehension is the task of locating the image region described by a natural language expression, which refer to the properties of the region or the relationships with other regions. Most previous work handles this problem by selecting the most relevant regions from a set of c...

Full description

Saved in:

Bibliographic Details
Main Authors:	Li, Yugang, Sun, Haibo, Chen, Zhe, Ding, Yudan, Zhou, Siqi
Other Authors:	School of Electrical and Electronic Engineering
Format:	Article
Language:	English
Published:	2021
Subjects:	Engineering::Computer science and engineering Stacked Attention Networks Referring Expression
Online Access:	https://hdl.handle.net/10356/146884
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-146884
record_format	dspace
spelling	sg-ntu-dr.10356-1468842021-03-12T06:35:32Z Stacked attention networks for referring expressions comprehension Li, Yugang Sun, Haibo Chen, Zhe Ding, Yudan Zhou, Siqi School of Electrical and Electronic Engineering Engineering::Computer science and engineering Stacked Attention Networks Referring Expression Referring expressions comprehension is the task of locating the image region described by a natural language expression, which refer to the properties of the region or the relationships with other regions. Most previous work handles this problem by selecting the most relevant regions from a set of candidate regions, when there are many candidate regions in the set these methods are inefficient. Inspired by recent success of image captioning by using deep learning methods, in this paper we proposed a framework to understand the referring expressions by multiple steps of reasoning. We present a model for referring expressions comprehension by selecting the most relevant region directly from the image. The core of our model is a recurrent attention network which can be seen as an extension of Memory Network. The proposed model capable of improving the results by multiple computational hops. We evaluate the proposed model on two referring expression datasets: Visual Genome and Flickr30k Entities. The experimental results demonstrate that the proposed model outperform previous state-of-the-art methods both in accuracy and efficiency. We also conduct an ablation experiment to show that the performance of the model is not getting better with the increase of the attention layers. Published version 2021-03-12T06:35:32Z 2021-03-12T06:35:32Z 2020 Journal Article Li, Y., Sun, H., Chen, Z., Ding, Y. & Zhou, S. (2020). Stacked attention networks for referring expressions comprehension. Computers, Materials and Continua, 65(3), 2529-2541. https://dx.doi.org/10.32604/cmc.2020.011886 1546-2218 https://hdl.handle.net/10356/146884 10.32604/cmc.2020.011886 2-s2.0-85091886113 3 65 2529 2541 en Computers, Materials and Continua © 2020 The Author(s). This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering Stacked Attention Networks Referring Expression
spellingShingle	Engineering::Computer science and engineering Stacked Attention Networks Referring Expression Li, Yugang Sun, Haibo Chen, Zhe Ding, Yudan Zhou, Siqi Stacked attention networks for referring expressions comprehension
description	Referring expressions comprehension is the task of locating the image region described by a natural language expression, which refer to the properties of the region or the relationships with other regions. Most previous work handles this problem by selecting the most relevant regions from a set of candidate regions, when there are many candidate regions in the set these methods are inefficient. Inspired by recent success of image captioning by using deep learning methods, in this paper we proposed a framework to understand the referring expressions by multiple steps of reasoning. We present a model for referring expressions comprehension by selecting the most relevant region directly from the image. The core of our model is a recurrent attention network which can be seen as an extension of Memory Network. The proposed model capable of improving the results by multiple computational hops. We evaluate the proposed model on two referring expression datasets: Visual Genome and Flickr30k Entities. The experimental results demonstrate that the proposed model outperform previous state-of-the-art methods both in accuracy and efficiency. We also conduct an ablation experiment to show that the performance of the model is not getting better with the increase of the attention layers.
author2	School of Electrical and Electronic Engineering
author_facet	School of Electrical and Electronic Engineering Li, Yugang Sun, Haibo Chen, Zhe Ding, Yudan Zhou, Siqi
format	Article
author	Li, Yugang Sun, Haibo Chen, Zhe Ding, Yudan Zhou, Siqi
author_sort	Li, Yugang
title	Stacked attention networks for referring expressions comprehension
title_short	Stacked attention networks for referring expressions comprehension
title_full	Stacked attention networks for referring expressions comprehension
title_fullStr	Stacked attention networks for referring expressions comprehension
title_full_unstemmed	Stacked attention networks for referring expressions comprehension
title_sort	stacked attention networks for referring expressions comprehension
publishDate	2021
url	https://hdl.handle.net/10356/146884
_version_	1695706184789524480

Stacked attention networks for referring expressions comprehension

Similar Items