Position-guided text prompt for vision-language pre-training
Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such...
Saved in:
Main Authors: | , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2023
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/9021 https://ink.library.smu.edu.sg/context/sis_research/article/10024/viewcontent/2023_CVPR_PTP.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-10024 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-100242024-07-25T08:06:36Z Position-guided text prompt for vision-language pre-training WANG, Alex Jinpeng ZHOU, Pan SHOU, Mike Zheng YAN Shuicheng, Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into N x N blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling “[P]” or “[O]” in a PTP “The block [P] has a [O]”. This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT [16] baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP [19] baseline. Moreover, PTP achieves comparable results with object-detector based methods [8, 23, 45], and much faster inference speed since PTP discards its object detector for inference while the later cannot. 2023-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9021 info:doi/10.1109/CVPR52729.2023.02226 https://ink.library.smu.edu.sg/context/sis_research/article/10024/viewcontent/2023_CVPR_PTP.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Graphics and Human Computer Interfaces Programming Languages and Compilers |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Graphics and Human Computer Interfaces Programming Languages and Compilers |
spellingShingle |
Graphics and Human Computer Interfaces Programming Languages and Compilers WANG, Alex Jinpeng ZHOU, Pan SHOU, Mike Zheng YAN Shuicheng, Position-guided text prompt for vision-language pre-training |
description |
Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into N x N blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling “[P]” or “[O]” in a PTP “The block [P] has a [O]”. This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT [16] baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP [19] baseline. Moreover, PTP achieves comparable results with object-detector based methods [8, 23, 45], and much faster inference speed since PTP discards its object detector for inference while the later cannot. |
format |
text |
author |
WANG, Alex Jinpeng ZHOU, Pan SHOU, Mike Zheng YAN Shuicheng, |
author_facet |
WANG, Alex Jinpeng ZHOU, Pan SHOU, Mike Zheng YAN Shuicheng, |
author_sort |
WANG, Alex Jinpeng |
title |
Position-guided text prompt for vision-language pre-training |
title_short |
Position-guided text prompt for vision-language pre-training |
title_full |
Position-guided text prompt for vision-language pre-training |
title_fullStr |
Position-guided text prompt for vision-language pre-training |
title_full_unstemmed |
Position-guided text prompt for vision-language pre-training |
title_sort |
position-guided text prompt for vision-language pre-training |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2023 |
url |
https://ink.library.smu.edu.sg/sis_research/9021 https://ink.library.smu.edu.sg/context/sis_research/article/10024/viewcontent/2023_CVPR_PTP.pdf |
_version_ |
1814047694677606400 |