Language-guided visual retrieval

Language-guided Visual Retrieval (LGVR) is an important direction of cross-modality learning. It aims to retrieve or localize the objective message from the untrimmed visual information under the guidance of a linguistic description. In this thesis we study two popular sub-tasks of LGVR, one is V...

Full description

Saved in:

Bibliographic Details
Main Author:	He, Su
Other Authors:	Lin Guosheng
Format:	Thesis-Master by Research
Language:	English
Published:	Nanyang Technological University 2021
Subjects:	Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Online Access:	https://hdl.handle.net/10356/151040
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-151040
record_format	dspace
spelling	sg-ntu-dr.10356-1510402021-07-08T16:01:19Z Language-guided visual retrieval He, Su Lin Guosheng School of Computer Science and Engineering gslin@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Language-guided Visual Retrieval (LGVR) is an important direction of cross-modality learning. It aims to retrieve or localize the objective message from the untrimmed visual information under the guidance of a linguistic description. In this thesis we study two popular sub-tasks of LGVR, one is Visual Grounding (VG) which aims to locate an object in the image, and the other is Natural Language Video Localization (NLVL) which aims to locate a targeted video clip from a long video span. For VG, we propose a novel modular network learning to match both the object’s symbolic feature and visual feature extracted by CNN with the linguistic information to achieve a better cross-modality alignment. Besides, a residual attention parser is raised to leverage the difficulty of understanding language expressions. For NLVL, we utilize the fine-grained semantic features of the sparse frames in the video. To organize the discrete features, we propose a network called Hybrid Graph Network to capture both the spatial and locally temporal relationships between objects in the frames and also apply semantically matching between objects and words. To model the long-span relationships between activities in the two modalities, we implement a temporal encoder based on the attentive model. Finally, we formulate the prediction as a binary classification task rather than regressing the specific boundaries. We conduct extensive experiments on popular datasets on the two tasks to validate the effectiveness of our proposed models. Master of Engineering 2021-06-23T04:58:23Z 2021-06-23T04:58:23Z 2021 Thesis-Master by Research He, S. (2021). Language-guided visual retrieval. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/151040 https://hdl.handle.net/10356/151040 10.32657/10356/151040 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision He, Su Language-guided visual retrieval
description	Language-guided Visual Retrieval (LGVR) is an important direction of cross-modality learning. It aims to retrieve or localize the objective message from the untrimmed visual information under the guidance of a linguistic description. In this thesis we study two popular sub-tasks of LGVR, one is Visual Grounding (VG) which aims to locate an object in the image, and the other is Natural Language Video Localization (NLVL) which aims to locate a targeted video clip from a long video span. For VG, we propose a novel modular network learning to match both the object’s symbolic feature and visual feature extracted by CNN with the linguistic information to achieve a better cross-modality alignment. Besides, a residual attention parser is raised to leverage the difficulty of understanding language expressions. For NLVL, we utilize the fine-grained semantic features of the sparse frames in the video. To organize the discrete features, we propose a network called Hybrid Graph Network to capture both the spatial and locally temporal relationships between objects in the frames and also apply semantically matching between objects and words. To model the long-span relationships between activities in the two modalities, we implement a temporal encoder based on the attentive model. Finally, we formulate the prediction as a binary classification task rather than regressing the specific boundaries. We conduct extensive experiments on popular datasets on the two tasks to validate the effectiveness of our proposed models.
author2	Lin Guosheng
author_facet	Lin Guosheng He, Su
format	Thesis-Master by Research
author	He, Su
author_sort	He, Su
title	Language-guided visual retrieval
title_short	Language-guided visual retrieval
title_full	Language-guided visual retrieval
title_fullStr	Language-guided visual retrieval
title_full_unstemmed	Language-guided visual retrieval
title_sort	language-guided visual retrieval
publisher	Nanyang Technological University
publishDate	2021
url	https://hdl.handle.net/10356/151040
_version_	1705151335984267264

Language-guided visual retrieval

Similar Items