Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection

Hateful meme detection is a challenging multimodal task that requires comprehension of both vision and language, as well as cross-modal interactions. Recent studies have tried to fine-tune pre-trained vision-language models (PVLMs) for this task. However, with increasing model sizes, it becomes impo...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلفون الرئيسيون:	CAO, Rui, HEE, Ming Shan, KUEK, Adriel, CHONG, Wen Haw, LEE, Roy Ka-Wei, JIANG, Jing
التنسيق:	text
اللغة:	English
منشور في:	Institutional Knowledge at Singapore Management University 2023
الموضوعات:	Memes multimodal semantic extraction Databases and Information Systems Graphic Communications Graphics and Human Computer Interfaces
الوصول للمادة أونلاين:	https://ink.library.smu.edu.sg/sis_research/8477 https://ink.library.smu.edu.sg/context/sis_research/article/9480/viewcontent/Pro_Cap_pvoa_cc_by.pdf
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
المؤسسة:	Singapore Management University
اللغة:	English

id	sg-smu-ink.sis_research-9480
record_format	dspace
spelling	sg-smu-ink.sis_research-94802024-01-04T09:12:00Z Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection CAO, Rui HEE, Ming Shan KUEK, Adriel CHONG, Wen Haw LEE, Roy Ka-Wei JIANG, Jing Hateful meme detection is a challenging multimodal task that requires comprehension of both vision and language, as well as cross-modal interactions. Recent studies have tried to fine-tune pre-trained vision-language models (PVLMs) for this task. However, with increasing model sizes, it becomes important to leverage powerful PVLMs more efficiently, rather than simply fine-tuning them. Recently, researchers have attempted to convert meme images into textual captions and prompt language models for predictions. This approach has shown good performance but suffers from non-informative image captions. Considering the two factors mentioned above, we propose a probing-based captioning approach to leverage PVLMs in a zero-shot visual question answering (VQA) manner. Specifically, we prompt a frozen PVLM by asking hateful content-related questions and use the answers as image captions (which we call Pro-Cap), so that the captions contain information critical for hateful content detection. The good performance of models with Pro-Cap on three benchmarks validates the effectiveness and generalization of the proposed method 1. 2023-11-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8477 info:doi/10.1145/3581783.3612498 https://ink.library.smu.edu.sg/context/sis_research/article/9480/viewcontent/Pro_Cap_pvoa_cc_by.pdf http://creativecommons.org/licenses/by/3.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Memes multimodal semantic extraction Databases and Information Systems Graphic Communications Graphics and Human Computer Interfaces
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Memes multimodal semantic extraction Databases and Information Systems Graphic Communications Graphics and Human Computer Interfaces
spellingShingle	Memes multimodal semantic extraction Databases and Information Systems Graphic Communications Graphics and Human Computer Interfaces CAO, Rui HEE, Ming Shan KUEK, Adriel CHONG, Wen Haw LEE, Roy Ka-Wei JIANG, Jing Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection
description	Hateful meme detection is a challenging multimodal task that requires comprehension of both vision and language, as well as cross-modal interactions. Recent studies have tried to fine-tune pre-trained vision-language models (PVLMs) for this task. However, with increasing model sizes, it becomes important to leverage powerful PVLMs more efficiently, rather than simply fine-tuning them. Recently, researchers have attempted to convert meme images into textual captions and prompt language models for predictions. This approach has shown good performance but suffers from non-informative image captions. Considering the two factors mentioned above, we propose a probing-based captioning approach to leverage PVLMs in a zero-shot visual question answering (VQA) manner. Specifically, we prompt a frozen PVLM by asking hateful content-related questions and use the answers as image captions (which we call Pro-Cap), so that the captions contain information critical for hateful content detection. The good performance of models with Pro-Cap on three benchmarks validates the effectiveness and generalization of the proposed method 1.
format	text
author	CAO, Rui HEE, Ming Shan KUEK, Adriel CHONG, Wen Haw LEE, Roy Ka-Wei JIANG, Jing
author_facet	CAO, Rui HEE, Ming Shan KUEK, Adriel CHONG, Wen Haw LEE, Roy Ka-Wei JIANG, Jing
author_sort	CAO, Rui
title	Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection
title_short	Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection
title_full	Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection
title_fullStr	Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection
title_full_unstemmed	Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection
title_sort	pro-cap: leveraging a frozen vision-language model for hateful meme detection
publisher	Institutional Knowledge at Singapore Management University
publishDate	2023
url	https://ink.library.smu.edu.sg/sis_research/8477 https://ink.library.smu.edu.sg/context/sis_research/article/9480/viewcontent/Pro_Cap_pvoa_cc_by.pdf
_version_	1787590776733040640

Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection

مواد مشابهة