Language-guided visual retrieval
Language-guided Visual Retrieval (LGVR) is an important direction of cross-modality learning. It aims to retrieve or localize the objective message from the untrimmed visual information under the guidance of a linguistic description. In this thesis we study two popular sub-tasks of LGVR, one is V...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Master by Research |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/151040 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Language-guided Visual Retrieval (LGVR) is an important direction of cross-modality
learning. It aims to retrieve or localize the objective message from the untrimmed visual
information under the guidance of a linguistic description. In this thesis we study two
popular sub-tasks of LGVR, one is Visual Grounding (VG) which aims to locate an object
in the image, and the other is Natural Language Video Localization (NLVL) which aims
to locate a targeted video clip from a long video span.
For VG, we propose a novel modular network learning to match both the object’s
symbolic feature and visual feature extracted by CNN with the linguistic information to
achieve a better cross-modality alignment. Besides, a residual attention parser is raised
to leverage the difficulty of understanding language expressions.
For NLVL, we utilize the fine-grained semantic features of the sparse frames in the
video. To organize the discrete features, we propose a network called Hybrid Graph
Network to capture both the spatial and locally temporal relationships between objects
in the frames and also apply semantically matching between objects and words. To
model the long-span relationships between activities in the two modalities, we implement
a temporal encoder based on the attentive model. Finally, we formulate the prediction
as a binary classification task rather than regressing the specific boundaries.
We conduct extensive experiments on popular datasets on the two tasks to validate
the effectiveness of our proposed models. |
---|