Unifying text, tables, and images for multimodal question answering
Multimodal question answering (MMQA), which aims to derive the answer from multiple knowledge modalities (e.g., text, tables, and images), has received increasing attention due to its board applications. Current approaches to MMQA often rely on single-modal or bi-modal QA models, which limits their...
Saved in:
Main Authors: | , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2023
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/9120 https://ink.library.smu.edu.sg/context/sis_research/article/10123/viewcontent/2023.findings_emnlp.626.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-10123 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-101232024-08-01T14:36:50Z Unifying text, tables, and images for multimodal question answering LUO, Haohao SHEN, Ying DENG, Yang Multimodal question answering (MMQA), which aims to derive the answer from multiple knowledge modalities (e.g., text, tables, and images), has received increasing attention due to its board applications. Current approaches to MMQA often rely on single-modal or bi-modal QA models, which limits their ability to effectively integrate information across all modalities and leverage the power of pre-trained language models. To address these limitations, we propose a novel framework called UniMMQA, which unifies three different input modalities into a text-to-text format by employing position-enhanced table linearization and diversified image captioning techniques. Additionally, we enhance cross-modal reasoning by incorporating a multimodal rationale generator, which produces textual descriptions of cross-modal relations for adaptation into the text-to-text generation process. Experimental results on three MMQA benchmark datasets show the superiority of UniMMQA in both supervised and unsupervised settings. 2023-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9120 info:doi/10.18653/v1/2023.findings-emnlp.626 https://ink.library.smu.edu.sg/context/sis_research/article/10123/viewcontent/2023.findings_emnlp.626.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Cross-modal Input modalities Language model Linearisation Multi-modal Power Question Answering Single-modal Text format Databases and Information Systems Graphics and Human Computer Interfaces |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Cross-modal Input modalities Language model Linearisation Multi-modal Power Question Answering Single-modal Text format Databases and Information Systems Graphics and Human Computer Interfaces |
spellingShingle |
Cross-modal Input modalities Language model Linearisation Multi-modal Power Question Answering Single-modal Text format Databases and Information Systems Graphics and Human Computer Interfaces LUO, Haohao SHEN, Ying DENG, Yang Unifying text, tables, and images for multimodal question answering |
description |
Multimodal question answering (MMQA), which aims to derive the answer from multiple knowledge modalities (e.g., text, tables, and images), has received increasing attention due to its board applications. Current approaches to MMQA often rely on single-modal or bi-modal QA models, which limits their ability to effectively integrate information across all modalities and leverage the power of pre-trained language models. To address these limitations, we propose a novel framework called UniMMQA, which unifies three different input modalities into a text-to-text format by employing position-enhanced table linearization and diversified image captioning techniques. Additionally, we enhance cross-modal reasoning by incorporating a multimodal rationale generator, which produces textual descriptions of cross-modal relations for adaptation into the text-to-text generation process. Experimental results on three MMQA benchmark datasets show the superiority of UniMMQA in both supervised and unsupervised settings. |
format |
text |
author |
LUO, Haohao SHEN, Ying DENG, Yang |
author_facet |
LUO, Haohao SHEN, Ying DENG, Yang |
author_sort |
LUO, Haohao |
title |
Unifying text, tables, and images for multimodal question answering |
title_short |
Unifying text, tables, and images for multimodal question answering |
title_full |
Unifying text, tables, and images for multimodal question answering |
title_fullStr |
Unifying text, tables, and images for multimodal question answering |
title_full_unstemmed |
Unifying text, tables, and images for multimodal question answering |
title_sort |
unifying text, tables, and images for multimodal question answering |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2023 |
url |
https://ink.library.smu.edu.sg/sis_research/9120 https://ink.library.smu.edu.sg/context/sis_research/article/10123/viewcontent/2023.findings_emnlp.626.pdf |
_version_ |
1814047747644325888 |