Learning to compose and reason with language tree structures for visual grounding

Grounding natural language in images, such as localizing "the black dog on the left of the tree", is one of the core problems in artificial intelligence, as it needs to comprehend the fine-grained language compositions. However, existing solutions merely rely on the association between the...

Full description

Saved in:
Bibliographic Details
Main Authors: Hong, Richang, Liu, Daqing, Mo, Xiaoyu, He, Xiangnan, Zhang, Hanwang
Other Authors: School of Computer Science and Engineering
Format: Article
Language:English
Published: 2022
Subjects:
Online Access:https://hdl.handle.net/10356/162632
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-162632
record_format dspace
spelling sg-ntu-dr.10356-1626322022-11-02T00:45:08Z Learning to compose and reason with language tree structures for visual grounding Hong, Richang Liu, Daqing Mo, Xiaoyu He, Xiangnan Zhang, Hanwang School of Computer Science and Engineering School of Electrical and Electronic Engineering Engineering::Electrical and electronic engineering Engineering::Computer science and engineering Fine-Grained Detection Tree Structure Grounding natural language in images, such as localizing "the black dog on the left of the tree", is one of the core problems in artificial intelligence, as it needs to comprehend the fine-grained language compositions. However, existing solutions merely rely on the association between the holistic language features and visual features, while neglect the nature of composite reasoning implied in the language. In this paper, we propose a natural language grounding model that can automatically compose a binary tree structure for parsing the language and then perform visual reasoning along the tree in a bottom-up fashion. We call our model RvG-Tree: Recursive Grounding Tree, which is inspired by the intuition that any language expression can be recursively decomposed into two constituent parts, and the grounding confidence score can be recursively accumulated by calculating their grounding scores returned by the two sub-trees.RvG-Tree can be trained end-to-end by using the Straight-Through Gumbel-Softmax estimator that allows the gradients from the continuous score functions passing through the discrete tree construction. Experiments on several benchmarks show that our model achieves the state-of-the-art performance with more explainable reasoning. Nanyang Technological University This work was supported by the National Key Research and Development Program under Grant 2017YFB1002203, the National Natural Science Foundation of China under Grant 61722204 and 61732007, and Alibaba-NTU Singapore Joint Research Institute. 2022-11-02T00:45:07Z 2022-11-02T00:45:07Z 2019 Journal Article Hong, R., Liu, D., Mo, X., He, X. & Zhang, H. (2019). Learning to compose and reason with language tree structures for visual grounding. IEEE Transactions On Pattern Analysis and Machine Intelligence, 44(2), 684-696. https://dx.doi.org/10.1109/TPAMI.2019.2911066 0162-8828 https://hdl.handle.net/10356/162632 10.1109/TPAMI.2019.2911066 30990419 2-s2.0-85122835371 2 44 684 696 en IEEE Transactions on Pattern Analysis and Machine Intelligence © 2019 IEEE. All rights reserved.
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Electrical and electronic engineering
Engineering::Computer science and engineering
Fine-Grained Detection
Tree Structure
spellingShingle Engineering::Electrical and electronic engineering
Engineering::Computer science and engineering
Fine-Grained Detection
Tree Structure
Hong, Richang
Liu, Daqing
Mo, Xiaoyu
He, Xiangnan
Zhang, Hanwang
Learning to compose and reason with language tree structures for visual grounding
description Grounding natural language in images, such as localizing "the black dog on the left of the tree", is one of the core problems in artificial intelligence, as it needs to comprehend the fine-grained language compositions. However, existing solutions merely rely on the association between the holistic language features and visual features, while neglect the nature of composite reasoning implied in the language. In this paper, we propose a natural language grounding model that can automatically compose a binary tree structure for parsing the language and then perform visual reasoning along the tree in a bottom-up fashion. We call our model RvG-Tree: Recursive Grounding Tree, which is inspired by the intuition that any language expression can be recursively decomposed into two constituent parts, and the grounding confidence score can be recursively accumulated by calculating their grounding scores returned by the two sub-trees.RvG-Tree can be trained end-to-end by using the Straight-Through Gumbel-Softmax estimator that allows the gradients from the continuous score functions passing through the discrete tree construction. Experiments on several benchmarks show that our model achieves the state-of-the-art performance with more explainable reasoning.
author2 School of Computer Science and Engineering
author_facet School of Computer Science and Engineering
Hong, Richang
Liu, Daqing
Mo, Xiaoyu
He, Xiangnan
Zhang, Hanwang
format Article
author Hong, Richang
Liu, Daqing
Mo, Xiaoyu
He, Xiangnan
Zhang, Hanwang
author_sort Hong, Richang
title Learning to compose and reason with language tree structures for visual grounding
title_short Learning to compose and reason with language tree structures for visual grounding
title_full Learning to compose and reason with language tree structures for visual grounding
title_fullStr Learning to compose and reason with language tree structures for visual grounding
title_full_unstemmed Learning to compose and reason with language tree structures for visual grounding
title_sort learning to compose and reason with language tree structures for visual grounding
publishDate 2022
url https://hdl.handle.net/10356/162632
_version_ 1749179140286709760