Delving into multimodal prompting for fine-grained visual classification

Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category, which poses challenges due to subtle inter-class discrepancies and large intra-class variations. However, prevailing approaches primarily focus on uni-modal visual concepts. Recent advancemen...

Full description

Saved in:

Bibliographic Details
Main Authors:	JIANG, Xin, TANG, Hao, GAO, Junyao, DU, Xiaoyu, HE, Shengfeng, LI, Zechao
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Fine-grained visual classification Categorization Multimodal prompts Optimization strategy Artificial Intelligence and Robotics Graphics and Human Computer Interfaces Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/8741 https://ink.library.smu.edu.sg/context/sis_research/article/9744/viewcontent/28034_Article_Text_32088_1_2_20240324.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-9744
record_format	dspace
spelling	sg-smu-ink.sis_research-97442024-05-03T07:51:40Z Delving into multimodal prompting for fine-grained visual classification JIANG, Xin TANG, Hao GAO, Junyao DU, Xiaoyu HE, Shengfeng LI, Zechao Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category, which poses challenges due to subtle inter-class discrepancies and large intra-class variations. However, prevailing approaches primarily focus on uni-modal visual concepts. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks, yet the applicability of such models to FGVC tasks remains uncertain. In this paper, we aim to fully exploit the capabilities of cross-modal description to tackle FGVC tasks and propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image pertaining (CLIP) model. Our MP-FGVC comprises a multimodal prompts scheme and a multimodal adaptation scheme. The former includes Subcategory-specific Vision Prompt (SsVP) and Discrepancy-aware Text Prompt (DaTP), which explicitly highlights the subcategory-specific discrepancies from the perspectives of both vision and language. The latter aligns the vision and text prompting elements in a common semantic space, facilitating cross-modal collaborative reasoning through a Vision-Language Fusion Module (VLFM) for further improvement on FGVC. Moreover, we tailor a two-stage optimization strategy for MP-FGVC to fully leverage the pre-trained CLIP model and expedite efficient adaptation for FGVC. Extensive experiments conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC. 2024-02-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8741 info:doi/10.1609/aaai.v38i3.28034 https://ink.library.smu.edu.sg/context/sis_research/article/9744/viewcontent/28034_Article_Text_32088_1_2_20240324.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Fine-grained visual classification Categorization Multimodal prompts Optimization strategy Artificial Intelligence and Robotics Graphics and Human Computer Interfaces Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Fine-grained visual classification Categorization Multimodal prompts Optimization strategy Artificial Intelligence and Robotics Graphics and Human Computer Interfaces Software Engineering
spellingShingle	Fine-grained visual classification Categorization Multimodal prompts Optimization strategy Artificial Intelligence and Robotics Graphics and Human Computer Interfaces Software Engineering JIANG, Xin TANG, Hao GAO, Junyao DU, Xiaoyu HE, Shengfeng LI, Zechao Delving into multimodal prompting for fine-grained visual classification
description	Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category, which poses challenges due to subtle inter-class discrepancies and large intra-class variations. However, prevailing approaches primarily focus on uni-modal visual concepts. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks, yet the applicability of such models to FGVC tasks remains uncertain. In this paper, we aim to fully exploit the capabilities of cross-modal description to tackle FGVC tasks and propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image pertaining (CLIP) model. Our MP-FGVC comprises a multimodal prompts scheme and a multimodal adaptation scheme. The former includes Subcategory-specific Vision Prompt (SsVP) and Discrepancy-aware Text Prompt (DaTP), which explicitly highlights the subcategory-specific discrepancies from the perspectives of both vision and language. The latter aligns the vision and text prompting elements in a common semantic space, facilitating cross-modal collaborative reasoning through a Vision-Language Fusion Module (VLFM) for further improvement on FGVC. Moreover, we tailor a two-stage optimization strategy for MP-FGVC to fully leverage the pre-trained CLIP model and expedite efficient adaptation for FGVC. Extensive experiments conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC.
format	text
author	JIANG, Xin TANG, Hao GAO, Junyao DU, Xiaoyu HE, Shengfeng LI, Zechao
author_facet	JIANG, Xin TANG, Hao GAO, Junyao DU, Xiaoyu HE, Shengfeng LI, Zechao
author_sort	JIANG, Xin
title	Delving into multimodal prompting for fine-grained visual classification
title_short	Delving into multimodal prompting for fine-grained visual classification
title_full	Delving into multimodal prompting for fine-grained visual classification
title_fullStr	Delving into multimodal prompting for fine-grained visual classification
title_full_unstemmed	Delving into multimodal prompting for fine-grained visual classification
title_sort	delving into multimodal prompting for fine-grained visual classification
publisher	Institutional Knowledge at Singapore Management University
publishDate	2024
url	https://ink.library.smu.edu.sg/sis_research/8741 https://ink.library.smu.edu.sg/context/sis_research/article/9744/viewcontent/28034_Article_Text_32088_1_2_20240324.pdf
_version_	1814047498865475584

Delving into multimodal prompting for fine-grained visual classification

Similar Items