Unifying text, tables, and images for multimodal question answering

Multimodal question answering (MMQA), which aims to derive the answer from multiple knowledge modalities (e.g., text, tables, and images), has received increasing attention due to its board applications. Current approaches to MMQA often rely on single-modal or bi-modal QA models, which limits their...

Full description

Saved in:
Bibliographic Details
Main Authors: LUO, Haohao, SHEN, Ying, DENG, Yang
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9120
https://ink.library.smu.edu.sg/context/sis_research/article/10123/viewcontent/2023.findings_emnlp.626.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-10123
record_format dspace
spelling sg-smu-ink.sis_research-101232024-08-01T14:36:50Z Unifying text, tables, and images for multimodal question answering LUO, Haohao SHEN, Ying DENG, Yang Multimodal question answering (MMQA), which aims to derive the answer from multiple knowledge modalities (e.g., text, tables, and images), has received increasing attention due to its board applications. Current approaches to MMQA often rely on single-modal or bi-modal QA models, which limits their ability to effectively integrate information across all modalities and leverage the power of pre-trained language models. To address these limitations, we propose a novel framework called UniMMQA, which unifies three different input modalities into a text-to-text format by employing position-enhanced table linearization and diversified image captioning techniques. Additionally, we enhance cross-modal reasoning by incorporating a multimodal rationale generator, which produces textual descriptions of cross-modal relations for adaptation into the text-to-text generation process. Experimental results on three MMQA benchmark datasets show the superiority of UniMMQA in both supervised and unsupervised settings. 2023-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9120 info:doi/10.18653/v1/2023.findings-emnlp.626 https://ink.library.smu.edu.sg/context/sis_research/article/10123/viewcontent/2023.findings_emnlp.626.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Cross-modal Input modalities Language model Linearisation Multi-modal Power Question Answering Single-modal Text format Databases and Information Systems Graphics and Human Computer Interfaces
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Cross-modal
Input modalities
Language model
Linearisation
Multi-modal
Power
Question Answering
Single-modal
Text format
Databases and Information Systems
Graphics and Human Computer Interfaces
spellingShingle Cross-modal
Input modalities
Language model
Linearisation
Multi-modal
Power
Question Answering
Single-modal
Text format
Databases and Information Systems
Graphics and Human Computer Interfaces
LUO, Haohao
SHEN, Ying
DENG, Yang
Unifying text, tables, and images for multimodal question answering
description Multimodal question answering (MMQA), which aims to derive the answer from multiple knowledge modalities (e.g., text, tables, and images), has received increasing attention due to its board applications. Current approaches to MMQA often rely on single-modal or bi-modal QA models, which limits their ability to effectively integrate information across all modalities and leverage the power of pre-trained language models. To address these limitations, we propose a novel framework called UniMMQA, which unifies three different input modalities into a text-to-text format by employing position-enhanced table linearization and diversified image captioning techniques. Additionally, we enhance cross-modal reasoning by incorporating a multimodal rationale generator, which produces textual descriptions of cross-modal relations for adaptation into the text-to-text generation process. Experimental results on three MMQA benchmark datasets show the superiority of UniMMQA in both supervised and unsupervised settings.
format text
author LUO, Haohao
SHEN, Ying
DENG, Yang
author_facet LUO, Haohao
SHEN, Ying
DENG, Yang
author_sort LUO, Haohao
title Unifying text, tables, and images for multimodal question answering
title_short Unifying text, tables, and images for multimodal question answering
title_full Unifying text, tables, and images for multimodal question answering
title_fullStr Unifying text, tables, and images for multimodal question answering
title_full_unstemmed Unifying text, tables, and images for multimodal question answering
title_sort unifying text, tables, and images for multimodal question answering
publisher Institutional Knowledge at Singapore Management University
publishDate 2023
url https://ink.library.smu.edu.sg/sis_research/9120
https://ink.library.smu.edu.sg/context/sis_research/article/10123/viewcontent/2023.findings_emnlp.626.pdf
_version_ 1814047747644325888