Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection
Hateful meme detection is a challenging multimodal task that requires comprehension of both vision and language, as well as cross-modal interactions. Recent studies have tried to fine-tune pre-trained vision-language models (PVLMs) for this task. However, with increasing model sizes, it becomes impo...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2023
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/8477 https://ink.library.smu.edu.sg/context/sis_research/article/9480/viewcontent/Pro_Cap_pvoa_cc_by.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-9480 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-94802024-01-04T09:12:00Z Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection CAO, Rui HEE, Ming Shan KUEK, Adriel CHONG, Wen Haw LEE, Roy Ka-Wei JIANG, Jing Hateful meme detection is a challenging multimodal task that requires comprehension of both vision and language, as well as cross-modal interactions. Recent studies have tried to fine-tune pre-trained vision-language models (PVLMs) for this task. However, with increasing model sizes, it becomes important to leverage powerful PVLMs more efficiently, rather than simply fine-tuning them. Recently, researchers have attempted to convert meme images into textual captions and prompt language models for predictions. This approach has shown good performance but suffers from non-informative image captions. Considering the two factors mentioned above, we propose a probing-based captioning approach to leverage PVLMs in a zero-shot visual question answering (VQA) manner. Specifically, we prompt a frozen PVLM by asking hateful content-related questions and use the answers as image captions (which we call Pro-Cap), so that the captions contain information critical for hateful content detection. The good performance of models with Pro-Cap on three benchmarks validates the effectiveness and generalization of the proposed method 1. 2023-11-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8477 info:doi/10.1145/3581783.3612498 https://ink.library.smu.edu.sg/context/sis_research/article/9480/viewcontent/Pro_Cap_pvoa_cc_by.pdf http://creativecommons.org/licenses/by/3.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Memes multimodal semantic extraction Databases and Information Systems Graphic Communications Graphics and Human Computer Interfaces |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Memes multimodal semantic extraction Databases and Information Systems Graphic Communications Graphics and Human Computer Interfaces |
spellingShingle |
Memes multimodal semantic extraction Databases and Information Systems Graphic Communications Graphics and Human Computer Interfaces CAO, Rui HEE, Ming Shan KUEK, Adriel CHONG, Wen Haw LEE, Roy Ka-Wei JIANG, Jing Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection |
description |
Hateful meme detection is a challenging multimodal task that requires comprehension of both vision and language, as well as cross-modal interactions. Recent studies have tried to fine-tune pre-trained vision-language models (PVLMs) for this task. However, with increasing model sizes, it becomes important to leverage powerful PVLMs more efficiently, rather than simply fine-tuning them. Recently, researchers have attempted to convert meme images into textual captions and prompt language models for predictions. This approach has shown good performance but suffers from non-informative image captions. Considering the two factors mentioned above, we propose a probing-based captioning approach to leverage PVLMs in a zero-shot visual question answering (VQA) manner. Specifically, we prompt a frozen PVLM by asking hateful content-related questions and use the answers as image captions (which we call Pro-Cap), so that the captions contain information critical for hateful content detection. The good performance of models with Pro-Cap on three benchmarks validates the effectiveness and generalization of the proposed method 1. |
format |
text |
author |
CAO, Rui HEE, Ming Shan KUEK, Adriel CHONG, Wen Haw LEE, Roy Ka-Wei JIANG, Jing |
author_facet |
CAO, Rui HEE, Ming Shan KUEK, Adriel CHONG, Wen Haw LEE, Roy Ka-Wei JIANG, Jing |
author_sort |
CAO, Rui |
title |
Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection |
title_short |
Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection |
title_full |
Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection |
title_fullStr |
Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection |
title_full_unstemmed |
Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection |
title_sort |
pro-cap: leveraging a frozen vision-language model for hateful meme detection |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2023 |
url |
https://ink.library.smu.edu.sg/sis_research/8477 https://ink.library.smu.edu.sg/context/sis_research/article/9480/viewcontent/Pro_Cap_pvoa_cc_by.pdf |
_version_ |
1787590776733040640 |