Learning to compose and reason with language tree structures for visual grounding
Grounding natural language in images, such as localizing "the black dog on the left of the tree", is one of the core problems in artificial intelligence, as it needs to comprehend the fine-grained language compositions. However, existing solutions merely rely on the association between the...
Saved in:
Main Authors: | , , , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/162632 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-162632 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1626322022-11-02T00:45:08Z Learning to compose and reason with language tree structures for visual grounding Hong, Richang Liu, Daqing Mo, Xiaoyu He, Xiangnan Zhang, Hanwang School of Computer Science and Engineering School of Electrical and Electronic Engineering Engineering::Electrical and electronic engineering Engineering::Computer science and engineering Fine-Grained Detection Tree Structure Grounding natural language in images, such as localizing "the black dog on the left of the tree", is one of the core problems in artificial intelligence, as it needs to comprehend the fine-grained language compositions. However, existing solutions merely rely on the association between the holistic language features and visual features, while neglect the nature of composite reasoning implied in the language. In this paper, we propose a natural language grounding model that can automatically compose a binary tree structure for parsing the language and then perform visual reasoning along the tree in a bottom-up fashion. We call our model RvG-Tree: Recursive Grounding Tree, which is inspired by the intuition that any language expression can be recursively decomposed into two constituent parts, and the grounding confidence score can be recursively accumulated by calculating their grounding scores returned by the two sub-trees.RvG-Tree can be trained end-to-end by using the Straight-Through Gumbel-Softmax estimator that allows the gradients from the continuous score functions passing through the discrete tree construction. Experiments on several benchmarks show that our model achieves the state-of-the-art performance with more explainable reasoning. Nanyang Technological University This work was supported by the National Key Research and Development Program under Grant 2017YFB1002203, the National Natural Science Foundation of China under Grant 61722204 and 61732007, and Alibaba-NTU Singapore Joint Research Institute. 2022-11-02T00:45:07Z 2022-11-02T00:45:07Z 2019 Journal Article Hong, R., Liu, D., Mo, X., He, X. & Zhang, H. (2019). Learning to compose and reason with language tree structures for visual grounding. IEEE Transactions On Pattern Analysis and Machine Intelligence, 44(2), 684-696. https://dx.doi.org/10.1109/TPAMI.2019.2911066 0162-8828 https://hdl.handle.net/10356/162632 10.1109/TPAMI.2019.2911066 30990419 2-s2.0-85122835371 2 44 684 696 en IEEE Transactions on Pattern Analysis and Machine Intelligence © 2019 IEEE. All rights reserved. |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Electrical and electronic engineering Engineering::Computer science and engineering Fine-Grained Detection Tree Structure |
spellingShingle |
Engineering::Electrical and electronic engineering Engineering::Computer science and engineering Fine-Grained Detection Tree Structure Hong, Richang Liu, Daqing Mo, Xiaoyu He, Xiangnan Zhang, Hanwang Learning to compose and reason with language tree structures for visual grounding |
description |
Grounding natural language in images, such as localizing "the black dog on the left of the tree", is one of the core problems in artificial intelligence, as it needs to comprehend the fine-grained language compositions. However, existing solutions merely rely on the association between the holistic language features and visual features, while neglect the nature of composite reasoning implied in the language. In this paper, we propose a natural language grounding model that can automatically compose a binary tree structure for parsing the language and then perform visual reasoning along the tree in a bottom-up fashion. We call our model RvG-Tree: Recursive Grounding Tree, which is inspired by the intuition that any language expression can be recursively decomposed into two constituent parts, and the grounding confidence score can be recursively accumulated by calculating their grounding scores returned by the two sub-trees.RvG-Tree can be trained end-to-end by using the Straight-Through Gumbel-Softmax estimator that allows the gradients from the continuous score functions passing through the discrete tree construction. Experiments on several benchmarks show that our model achieves the state-of-the-art performance with more explainable reasoning. |
author2 |
School of Computer Science and Engineering |
author_facet |
School of Computer Science and Engineering Hong, Richang Liu, Daqing Mo, Xiaoyu He, Xiangnan Zhang, Hanwang |
format |
Article |
author |
Hong, Richang Liu, Daqing Mo, Xiaoyu He, Xiangnan Zhang, Hanwang |
author_sort |
Hong, Richang |
title |
Learning to compose and reason with language tree structures for visual grounding |
title_short |
Learning to compose and reason with language tree structures for visual grounding |
title_full |
Learning to compose and reason with language tree structures for visual grounding |
title_fullStr |
Learning to compose and reason with language tree structures for visual grounding |
title_full_unstemmed |
Learning to compose and reason with language tree structures for visual grounding |
title_sort |
learning to compose and reason with language tree structures for visual grounding |
publishDate |
2022 |
url |
https://hdl.handle.net/10356/162632 |
_version_ |
1749179140286709760 |