Balancing visual context understanding in dialogue for image retrieval
In the realm of dialogue-to-image retrieval, the primary challenge is to fetch images from a pre-compiled database that accurately reflect the intent embedded within the dialogue history. Existing methods often overemphasize inter-modal alignment, neglecting the nuanced nature of conversational cont...
Saved in:
Main Authors: | , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2024
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/9693 https://ink.library.smu.edu.sg/context/sis_research/article/10693/viewcontent/2024.findings_emnlp.465.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-10693 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-106932024-11-28T09:06:28Z Balancing visual context understanding in dialogue for image retrieval WEI, Zhaohui LIAO, Lizi DU, Xiaoyu XIANG, Xinguang In the realm of dialogue-to-image retrieval, the primary challenge is to fetch images from a pre-compiled database that accurately reflect the intent embedded within the dialogue history. Existing methods often overemphasize inter-modal alignment, neglecting the nuanced nature of conversational context. Dialogue histories are frequently cluttered with redundant information and often lack direct image descriptions, leading to a substantial disconnect between conversational content and visual representation. This study introduces VCU, a novel framework designed to enhance the comprehension of dialogue history and improve cross-modal matching for image retrieval. VCU leverages large language models (LLMs) to perform a two-step extraction process. It generates precise image-related descriptions from dialogues, while also enhancing visual representation by utilizing object-list texts associated with images. Additionally, auxiliary query collections are constructed to balance the matching process, thereby reducing bias in similarity computations. Experimental results demonstrate that VCU significantly outperforms baseline methods in dialogue-to-image retrieval tasks, highlighting its potential for practical application and effectiveness in bridging the gap between dialogue context and visual content. 2024-11-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9693 https://ink.library.smu.edu.sg/context/sis_research/article/10693/viewcontent/2024.findings_emnlp.465.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Dialogue-to-image retrieval Dialogue history comprehension Visual context understanding Artificial Intelligence and Robotics Computer Sciences |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Dialogue-to-image retrieval Dialogue history comprehension Visual context understanding Artificial Intelligence and Robotics Computer Sciences |
spellingShingle |
Dialogue-to-image retrieval Dialogue history comprehension Visual context understanding Artificial Intelligence and Robotics Computer Sciences WEI, Zhaohui LIAO, Lizi DU, Xiaoyu XIANG, Xinguang Balancing visual context understanding in dialogue for image retrieval |
description |
In the realm of dialogue-to-image retrieval, the primary challenge is to fetch images from a pre-compiled database that accurately reflect the intent embedded within the dialogue history. Existing methods often overemphasize inter-modal alignment, neglecting the nuanced nature of conversational context. Dialogue histories are frequently cluttered with redundant information and often lack direct image descriptions, leading to a substantial disconnect between conversational content and visual representation. This study introduces VCU, a novel framework designed to enhance the comprehension of dialogue history and improve cross-modal matching for image retrieval. VCU leverages large language models (LLMs) to perform a two-step extraction process. It generates precise image-related descriptions from dialogues, while also enhancing visual representation by utilizing object-list texts associated with images. Additionally, auxiliary query collections are constructed to balance the matching process, thereby reducing bias in similarity computations. Experimental results demonstrate that VCU significantly outperforms baseline methods in dialogue-to-image retrieval tasks, highlighting its potential for practical application and effectiveness in bridging the gap between dialogue context and visual content. |
format |
text |
author |
WEI, Zhaohui LIAO, Lizi DU, Xiaoyu XIANG, Xinguang |
author_facet |
WEI, Zhaohui LIAO, Lizi DU, Xiaoyu XIANG, Xinguang |
author_sort |
WEI, Zhaohui |
title |
Balancing visual context understanding in dialogue for image retrieval |
title_short |
Balancing visual context understanding in dialogue for image retrieval |
title_full |
Balancing visual context understanding in dialogue for image retrieval |
title_fullStr |
Balancing visual context understanding in dialogue for image retrieval |
title_full_unstemmed |
Balancing visual context understanding in dialogue for image retrieval |
title_sort |
balancing visual context understanding in dialogue for image retrieval |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2024 |
url |
https://ink.library.smu.edu.sg/sis_research/9693 https://ink.library.smu.edu.sg/context/sis_research/article/10693/viewcontent/2024.findings_emnlp.465.pdf |
_version_ |
1819113104810704896 |