CgT-GAN: CLIP-guided text GAN for image captioning

The large-scale visual-language pre-trained model, Contrastive Language-Image Pre-training (CLIP), has significantly improved image captioning for scenarios without human-annotated image-caption pairs. Recent advanced CLIP-based image captioning without human annotations follows a text-only training...

Full description

Saved in:

Bibliographic Details
Main Authors:	YU, Jiarui, LI, Haoran, HAO, Yanbin, ZHU, Bin, XU, Tong, HE, Xiangnan
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2023
Subjects:	Image captioning CLIP Reinforcement learning GAN Graphics and Human Computer Interfaces Programming Languages and Compilers
Online Access:	https://ink.library.smu.edu.sg/sis_research/9012 https://ink.library.smu.edu.sg/context/sis_research/article/10015/viewcontent/CgT_GAN.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-10015
record_format	dspace
spelling	sg-smu-ink.sis_research-100152024-07-25T08:12:48Z CgT-GAN: CLIP-guided text GAN for image captioning YU, Jiarui LI, Haoran HAO, Yanbin ZHU, Bin XU, Tong HE, Xiangnan The large-scale visual-language pre-trained model, Contrastive Language-Image Pre-training (CLIP), has significantly improved image captioning for scenarios without human-annotated image-caption pairs. Recent advanced CLIP-based image captioning without human annotations follows a text-only training paradigm, i.e., reconstructing text from shared embedding space. Nevertheless, these approaches are limited by the training/inference gap or huge storage requirements for text embeddings. Given that it is trivial to obtain images in the real world, we propose CLIP-guided text GAN (CgT-GAN), which incorporates images into the training process to enable the model to "see" real visual modality. Particularly, we use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus and CLIP-based reward to provide semantic guidance. The caption generator is jointly rewarded based on the caption naturalness to human language calculated from the GAN's discriminator and the semantic guidance reward computed by the CLIP-based reward module. In addition to the cosine similarity as the semantic guidance reward (i.e., CLIP-cos), we further introduce a novel semantic guidance reward called CLIP-agg, which aligns the generated caption with a weighted text embedding by attentively aggregating the entire corpus. Experimental results on three subtasks (ZS-IC, In-UIC and Cross-UIC) show that CgT-GAN outperforms state-of-the-art methods significantly across all metrics. Code is available at https://github.com/Lihr747/CgtGAN. 2023-11-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9012 info:doi/10.1145/3581783.3611891 https://ink.library.smu.edu.sg/context/sis_research/article/10015/viewcontent/CgT_GAN.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Image captioning CLIP Reinforcement learning GAN Graphics and Human Computer Interfaces Programming Languages and Compilers
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Image captioning CLIP Reinforcement learning GAN Graphics and Human Computer Interfaces Programming Languages and Compilers
spellingShingle	Image captioning CLIP Reinforcement learning GAN Graphics and Human Computer Interfaces Programming Languages and Compilers YU, Jiarui LI, Haoran HAO, Yanbin ZHU, Bin XU, Tong HE, Xiangnan CgT-GAN: CLIP-guided text GAN for image captioning
description	The large-scale visual-language pre-trained model, Contrastive Language-Image Pre-training (CLIP), has significantly improved image captioning for scenarios without human-annotated image-caption pairs. Recent advanced CLIP-based image captioning without human annotations follows a text-only training paradigm, i.e., reconstructing text from shared embedding space. Nevertheless, these approaches are limited by the training/inference gap or huge storage requirements for text embeddings. Given that it is trivial to obtain images in the real world, we propose CLIP-guided text GAN (CgT-GAN), which incorporates images into the training process to enable the model to "see" real visual modality. Particularly, we use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus and CLIP-based reward to provide semantic guidance. The caption generator is jointly rewarded based on the caption naturalness to human language calculated from the GAN's discriminator and the semantic guidance reward computed by the CLIP-based reward module. In addition to the cosine similarity as the semantic guidance reward (i.e., CLIP-cos), we further introduce a novel semantic guidance reward called CLIP-agg, which aligns the generated caption with a weighted text embedding by attentively aggregating the entire corpus. Experimental results on three subtasks (ZS-IC, In-UIC and Cross-UIC) show that CgT-GAN outperforms state-of-the-art methods significantly across all metrics. Code is available at https://github.com/Lihr747/CgtGAN.
format	text
author	YU, Jiarui LI, Haoran HAO, Yanbin ZHU, Bin XU, Tong HE, Xiangnan
author_facet	YU, Jiarui LI, Haoran HAO, Yanbin ZHU, Bin XU, Tong HE, Xiangnan
author_sort	YU, Jiarui
title	CgT-GAN: CLIP-guided text GAN for image captioning
title_short	CgT-GAN: CLIP-guided text GAN for image captioning
title_full	CgT-GAN: CLIP-guided text GAN for image captioning
title_fullStr	CgT-GAN: CLIP-guided text GAN for image captioning
title_full_unstemmed	CgT-GAN: CLIP-guided text GAN for image captioning
title_sort	cgt-gan: clip-guided text gan for image captioning
publisher	Institutional Knowledge at Singapore Management University
publishDate	2023
url	https://ink.library.smu.edu.sg/sis_research/9012 https://ink.library.smu.edu.sg/context/sis_research/article/10015/viewcontent/CgT_GAN.pdf
_version_	1814047692189335552

CgT-GAN: CLIP-guided text GAN for image captioning

Similar Items