Grounding referring expressions in images with neural module tree network
Grounding referring expressions in images or visual grounding for short, is a task used in Artificial Intelligence (AI) to locate and identify a target object through localization of natural language in images. The complex task of visual grounding requires composite visual reasoning to better m...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/156618 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Grounding referring expressions in images or visual grounding for short, is a task used in
Artificial Intelligence (AI) to locate and identify a target object through localization of
natural language in images. The complex task of visual grounding requires composite
visual reasoning to better mimic the human logical thought process. However, existing
methods do not extend towards the multiple components of natural language and
over-simplify it into either a monolithic sentence embedding or a rough composition of
subject-predicate-object. To venture more into the complexity of natural language, a
Neural Module Tree network (NMTree) is applied on the dependency parsing tree of the
referring expression during the visual grounding process. Each node of the dependency
parsing tree is taken as a neural module that calculates visual attention where the
grounding score is accumulated in a bottom-up fashion to the root node of the tree.
Gumbel-Softmax approximation is utilized to train the modules and their assembly
end-to-end reducing parsing errors. NMTree will allow for the composite reasoning
portion to be more loosely coupled from the visual grounding providing more intuitive
perception during localization. The inclusion of NMTree had provided better explanation
of grounding natural language and outperforms state-of-the-arts on several benchmarks. |
---|