Position-guided text prompt for vision-language pre-training

Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such...

Full description

Saved in:
Bibliographic Details
Main Authors: WANG, Alex Jinpeng, ZHOU, Pan, SHOU, Mike Zheng, YAN Shuicheng
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9021
https://ink.library.smu.edu.sg/context/sis_research/article/10024/viewcontent/2023_CVPR_PTP.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-10024
record_format dspace
spelling sg-smu-ink.sis_research-100242024-07-25T08:06:36Z Position-guided text prompt for vision-language pre-training WANG, Alex Jinpeng ZHOU, Pan SHOU, Mike Zheng YAN Shuicheng, Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into N x N blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling “[P]” or “[O]” in a PTP “The block [P] has a [O]”. This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT [16] baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP [19] baseline. Moreover, PTP achieves comparable results with object-detector based methods [8, 23, 45], and much faster inference speed since PTP discards its object detector for inference while the later cannot. 2023-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9021 info:doi/10.1109/CVPR52729.2023.02226 https://ink.library.smu.edu.sg/context/sis_research/article/10024/viewcontent/2023_CVPR_PTP.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Graphics and Human Computer Interfaces Programming Languages and Compilers
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Graphics and Human Computer Interfaces
Programming Languages and Compilers
spellingShingle Graphics and Human Computer Interfaces
Programming Languages and Compilers
WANG, Alex Jinpeng
ZHOU, Pan
SHOU, Mike Zheng
YAN Shuicheng,
Position-guided text prompt for vision-language pre-training
description Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into N x N blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling “[P]” or “[O]” in a PTP “The block [P] has a [O]”. This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT [16] baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP [19] baseline. Moreover, PTP achieves comparable results with object-detector based methods [8, 23, 45], and much faster inference speed since PTP discards its object detector for inference while the later cannot.
format text
author WANG, Alex Jinpeng
ZHOU, Pan
SHOU, Mike Zheng
YAN Shuicheng,
author_facet WANG, Alex Jinpeng
ZHOU, Pan
SHOU, Mike Zheng
YAN Shuicheng,
author_sort WANG, Alex Jinpeng
title Position-guided text prompt for vision-language pre-training
title_short Position-guided text prompt for vision-language pre-training
title_full Position-guided text prompt for vision-language pre-training
title_fullStr Position-guided text prompt for vision-language pre-training
title_full_unstemmed Position-guided text prompt for vision-language pre-training
title_sort position-guided text prompt for vision-language pre-training
publisher Institutional Knowledge at Singapore Management University
publishDate 2023
url https://ink.library.smu.edu.sg/sis_research/9021
https://ink.library.smu.edu.sg/context/sis_research/article/10024/viewcontent/2023_CVPR_PTP.pdf
_version_ 1814047694677606400