Position-guided text prompt for vision-language pre-training

Position-guided text prompt for vision-language pre-training

Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such...

Full description

Saved in:

Bibliographic Details
Main Authors:	WANG, Alex Jinpeng, ZHOU, Pan, SHOU, Mike Zheng, YAN Shuicheng
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2023
Subjects:	Graphics and Human Computer Interfaces Programming Languages and Compilers
Online Access:	https://ink.library.smu.edu.sg/sis_research/9021 https://ink.library.smu.edu.sg/context/sis_research/article/10024/viewcontent/2023_CVPR_PTP.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

Similar Items

Enhancing visual grounding in vision-language pre-training with position-guided text prompts
by: WANG, Alex Jinpeng, et al.
Published: (2024)

LPT: Long-tailed prompt tuning for image classification
by: DONG, Bowen, et al.
Published: (2023)

Let’s think outside the box: Exploring leap-of-thought in large language models with multimodal humor generation
by: ZHONG, Shanshan, et al.
Published: (2024)

CgT-GAN: CLIP-guided text GAN for image captioning
by: YU, Jiarui, et al.
Published: (2023)

MultiGPrompt for multi-task pre-training and prompting on graphs
by: YU, Xingtong, et al.
Published: (2024)

VLStereoSet: A study of stereotypical bias in pre-trained vision-language models
by: ZHOU, Kankan, et al.
Published: (2022)

Prompt for extraction? PAIE: Prompting Argument Interaction for Event Argument Extraction
by: MA, Yubo, et al.
Published: (2022)

Attack prompt generation for red teaming and defending large language models
by: DENG, Boyi, et al.
Published: (2023)

Wav-BERT: Cooperative acoustic and linguistic representation learning for low-resource speech recognition
by: ZHENG, Guolin, et al.
Published: (2021)

Using pre-trained models for vision-language understanding tasks
by: CAO, Rui
Published: (2024)

Aligning images in the wild
by: LIN, Wen-yan, et al.
Published: (2012)

Towards understanding why mask reconstruction pretraining helps in downstream tasks
by: PAN, Jiachun, et al.
Published: (2023)

Generalized graph prompt: Toward a unification of pre-training and downstream tasks on graphs
by: YU, Xingtong, et al.
Published: (2024)

HGPrompt: Bridging homogeneous and heterogeneous graphs for few-shot prompt learning
by: YU, Xingtong, et al.
Published: (2024)

Prompting for multimodal hateful meme classification
by: CAO, Rui, et al.
Published: (2022)

Compositional prompt tuning with motion cues for open-vocabulary video relation detection
by: GAO, Kaifeng, et al.
Published: (2023)

Consistent3D: Towards consistent high-fidelity text-to-3D generation with deterministic sampling prior
by: WU, Zike, et al.
Published: (2024)

Replay-and-forget-free graph class-incremental learning: A task profiling and prompting approach
by: NIU, Chaoxi, et al.
Published: (2024)

InceptionNeXt: When Inception meets ConvNeXt
by: YU, Weihao, et al.
Published: (2024)

Improving GAN training with probability ratio clipping and sample reweighting
by: WU, Yue, et al.
Published: (2020)

Augmenting low-resource text classification with graph-grounded pre-training and prompting
by: WEN, Zhihao, et al.
Published: (2023)

Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection
by: CAO, Rui, et al.
Published: (2023)

Self-supervised multi-class pre-training for unsupervised anomaly detection and segmentation in medical images
by: TIAN, Yu, et al.
Published: (2021)

Efficient meta learning via minibatch proximal update
by: ZHOU, Pan, et al.
Published: (2019)

EditAnything: Empowering unparalleled flexibility in image editing and generation
by: GAO, Shanghua, et al.
Published: (2023)

Do pre-trained models benefit knowledge graph completion? A reliable evaluation and a reasonable approach
by: LV, Xin, et al.
Published: (2022)

MetaFormer baselines for vision
by: YU, Weihao, et al.
Published: (2023)

Cookgan: Causality based text-to-image synthesis
by: ZHU, Bin, et al.
Published: (2020)

Cross-thought for sentence encoder pre-training
by: WANG, Shuohang, et al.
Published: (2020)

CAPGEN/Online: a fourth generation language
by: Ahmed, Iftikhar
Published: (1992)

Injecting descriptive meta-information into pre-trained language models with hypernetworks
by: DUAN, Wenying, et al.
Published: (2021)

Towards LLM-based fact verification on news claims with a hierarchical step-by-step prompting method
by: ZHANG, Xuan, et al.
Published: (2023)

How important is the train-validation split in meta-learning?
by: BAI, Yu, et al.
Published: (2021)

VIREO@TRECVID 2016: Multimedia event detection, ad-hoc video search, video to text description
by: ZHANG, Hao, et al.
Published: (2016)

Bridge text and knowledge by learning multi-prototype entity mention embedding
by: CAO, Yixin, et al.
Published: (2017)

Edgeduet: Tiling small object detection for edge assisted autonomous mobile vision
by: WANG, Xu, et al.
Published: (2021)

MetaFormer is actually what you need for vision
by: YU, Weihao, et al.
Published: (2022)

Delving into multimodal prompting for fine-grained visual classification
by: JIANG, Xin, et al.
Published: (2024)

Three dimensional graphics library
by: Lua, Moises, et al.
Published: (1990)

Task relation networks
by: LI, Jianshu, et al.
Published: (2019)