Learning to compose and reason with language tree structures for visual grounding

Grounding natural language in images, such as localizing "the black dog on the left of the tree", is one of the core problems in artificial intelligence, as it needs to comprehend the fine-grained language compositions. However, existing solutions merely rely on the association between the...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلفون الرئيسيون:	Hong, Richang, Liu, Daqing, Mo, Xiaoyu, He, Xiangnan, Zhang, Hanwang
مؤلفون آخرون:	School of Computer Science and Engineering
التنسيق:	مقال
اللغة:	English
منشور في:	2022
الموضوعات:	Engineering::Electrical and electronic engineering Engineering::Computer science and engineering Fine-Grained Detection Tree Structure
الوصول للمادة أونلاين:	https://hdl.handle.net/10356/162632
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

id	sg-ntu-dr.10356-162632
record_format	dspace
spelling	sg-ntu-dr.10356-1626322022-11-02T00:45:08Z Learning to compose and reason with language tree structures for visual grounding Hong, Richang Liu, Daqing Mo, Xiaoyu He, Xiangnan Zhang, Hanwang School of Computer Science and Engineering School of Electrical and Electronic Engineering Engineering::Electrical and electronic engineering Engineering::Computer science and engineering Fine-Grained Detection Tree Structure Grounding natural language in images, such as localizing "the black dog on the left of the tree", is one of the core problems in artificial intelligence, as it needs to comprehend the fine-grained language compositions. However, existing solutions merely rely on the association between the holistic language features and visual features, while neglect the nature of composite reasoning implied in the language. In this paper, we propose a natural language grounding model that can automatically compose a binary tree structure for parsing the language and then perform visual reasoning along the tree in a bottom-up fashion. We call our model RvG-Tree: Recursive Grounding Tree, which is inspired by the intuition that any language expression can be recursively decomposed into two constituent parts, and the grounding confidence score can be recursively accumulated by calculating their grounding scores returned by the two sub-trees.RvG-Tree can be trained end-to-end by using the Straight-Through Gumbel-Softmax estimator that allows the gradients from the continuous score functions passing through the discrete tree construction. Experiments on several benchmarks show that our model achieves the state-of-the-art performance with more explainable reasoning. Nanyang Technological University This work was supported by the National Key Research and Development Program under Grant 2017YFB1002203, the National Natural Science Foundation of China under Grant 61722204 and 61732007, and Alibaba-NTU Singapore Joint Research Institute. 2022-11-02T00:45:07Z 2022-11-02T00:45:07Z 2019 Journal Article Hong, R., Liu, D., Mo, X., He, X. & Zhang, H. (2019). Learning to compose and reason with language tree structures for visual grounding. IEEE Transactions On Pattern Analysis and Machine Intelligence, 44(2), 684-696. https://dx.doi.org/10.1109/TPAMI.2019.2911066 0162-8828 https://hdl.handle.net/10356/162632 10.1109/TPAMI.2019.2911066 30990419 2-s2.0-85122835371 2 44 684 696 en IEEE Transactions on Pattern Analysis and Machine Intelligence © 2019 IEEE. All rights reserved.
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Electrical and electronic engineering Engineering::Computer science and engineering Fine-Grained Detection Tree Structure
spellingShingle	Engineering::Electrical and electronic engineering Engineering::Computer science and engineering Fine-Grained Detection Tree Structure Hong, Richang Liu, Daqing Mo, Xiaoyu He, Xiangnan Zhang, Hanwang Learning to compose and reason with language tree structures for visual grounding
description	Grounding natural language in images, such as localizing "the black dog on the left of the tree", is one of the core problems in artificial intelligence, as it needs to comprehend the fine-grained language compositions. However, existing solutions merely rely on the association between the holistic language features and visual features, while neglect the nature of composite reasoning implied in the language. In this paper, we propose a natural language grounding model that can automatically compose a binary tree structure for parsing the language and then perform visual reasoning along the tree in a bottom-up fashion. We call our model RvG-Tree: Recursive Grounding Tree, which is inspired by the intuition that any language expression can be recursively decomposed into two constituent parts, and the grounding confidence score can be recursively accumulated by calculating their grounding scores returned by the two sub-trees.RvG-Tree can be trained end-to-end by using the Straight-Through Gumbel-Softmax estimator that allows the gradients from the continuous score functions passing through the discrete tree construction. Experiments on several benchmarks show that our model achieves the state-of-the-art performance with more explainable reasoning.
author2	School of Computer Science and Engineering
author_facet	School of Computer Science and Engineering Hong, Richang Liu, Daqing Mo, Xiaoyu He, Xiangnan Zhang, Hanwang
format	Article
author	Hong, Richang Liu, Daqing Mo, Xiaoyu He, Xiangnan Zhang, Hanwang
author_sort	Hong, Richang
title	Learning to compose and reason with language tree structures for visual grounding
title_short	Learning to compose and reason with language tree structures for visual grounding
title_full	Learning to compose and reason with language tree structures for visual grounding
title_fullStr	Learning to compose and reason with language tree structures for visual grounding
title_full_unstemmed	Learning to compose and reason with language tree structures for visual grounding
title_sort	learning to compose and reason with language tree structures for visual grounding
publishDate	2022
url	https://hdl.handle.net/10356/162632
_version_	1749179140286709760

Learning to compose and reason with language tree structures for visual grounding

مواد مشابهة