Alignment-enriched tuning for patch-level pre-trained document image models

Alignment between image and text has shown promising im provements on patch-level pre-trained document image mod els. However, investigating more effective or finer-grained alignment techniques during pre-training requires a large amount of computation cost and time. Thus, a question natu rally aris...

Full description

Saved in:

Bibliographic Details
Main Authors:	WANG, Lei, HE, Jiabang, XU, Xing, LIU, Ning, LIU, Hui
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2023
Subjects:	Artificial Intelligence and Robotics Databases and Information Systems
Online Access:	https://ink.library.smu.edu.sg/sis_research/9318 https://ink.library.smu.edu.sg/context/sis_research/article/10318/viewcontent/Alignment_Enriched_pv.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-10318
record_format	dspace
spelling	sg-smu-ink.sis_research-103182024-09-26T07:54:56Z Alignment-enriched tuning for patch-level pre-trained document image models WANG, Lei HE, Jiabang XU, Xing LIU, Ning LIU, Hui Alignment between image and text has shown promising im provements on patch-level pre-trained document image mod els. However, investigating more effective or finer-grained alignment techniques during pre-training requires a large amount of computation cost and time. Thus, a question natu rally arises: Could we fine-tune the pre-trained models adap tive to downstream tasks with alignment objectives and achieve comparable or better performance? In this paper, we pro pose a new model architecture with alignment-enriched tuning (dubbed AETNet) upon pre-trained document image models, to adapt downstream tasks with the joint task-specific super vised and alignment-aware contrastive objective. Specifically, weintroduce an extra visual transformer as the alignment-ware image encoder and an extra text transformer as the alignment ware text encoder before multimodal fusion. We consider alignment in the following three aspects: 1) document-level alignment by leveraging the cross-modal and intra-modal con trastive loss; 2) global-local alignment for modeling localized and structural information in document images; and 3) local level alignment for more accurate patch-level information. Ex periments on various downstream tasks show that AETNet can achieve state-of-the-art performance on various downstream tasks. Notably, AETNet consistently outperforms state-of-the art pre-trained models, such as LayoutLMv3 with fine-tuning techniques, on three different downstream tasks. Code is available at https://github.com/MAEHCM/AET. 2023-02-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9318 https://ink.library.smu.edu.sg/context/sis_research/article/10318/viewcontent/Alignment_Enriched_pv.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Artificial Intelligence and Robotics Databases and Information Systems
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Artificial Intelligence and Robotics Databases and Information Systems
spellingShingle	Artificial Intelligence and Robotics Databases and Information Systems WANG, Lei HE, Jiabang XU, Xing LIU, Ning LIU, Hui Alignment-enriched tuning for patch-level pre-trained document image models
description	Alignment between image and text has shown promising im provements on patch-level pre-trained document image mod els. However, investigating more effective or finer-grained alignment techniques during pre-training requires a large amount of computation cost and time. Thus, a question natu rally arises: Could we fine-tune the pre-trained models adap tive to downstream tasks with alignment objectives and achieve comparable or better performance? In this paper, we pro pose a new model architecture with alignment-enriched tuning (dubbed AETNet) upon pre-trained document image models, to adapt downstream tasks with the joint task-specific super vised and alignment-aware contrastive objective. Specifically, weintroduce an extra visual transformer as the alignment-ware image encoder and an extra text transformer as the alignment ware text encoder before multimodal fusion. We consider alignment in the following three aspects: 1) document-level alignment by leveraging the cross-modal and intra-modal con trastive loss; 2) global-local alignment for modeling localized and structural information in document images; and 3) local level alignment for more accurate patch-level information. Ex periments on various downstream tasks show that AETNet can achieve state-of-the-art performance on various downstream tasks. Notably, AETNet consistently outperforms state-of-the art pre-trained models, such as LayoutLMv3 with fine-tuning techniques, on three different downstream tasks. Code is available at https://github.com/MAEHCM/AET.
format	text
author	WANG, Lei HE, Jiabang XU, Xing LIU, Ning LIU, Hui
author_facet	WANG, Lei HE, Jiabang XU, Xing LIU, Ning LIU, Hui
author_sort	WANG, Lei
title	Alignment-enriched tuning for patch-level pre-trained document image models
title_short	Alignment-enriched tuning for patch-level pre-trained document image models
title_full	Alignment-enriched tuning for patch-level pre-trained document image models
title_fullStr	Alignment-enriched tuning for patch-level pre-trained document image models
title_full_unstemmed	Alignment-enriched tuning for patch-level pre-trained document image models
title_sort	alignment-enriched tuning for patch-level pre-trained document image models
publisher	Institutional Knowledge at Singapore Management University
publishDate	2023
url	https://ink.library.smu.edu.sg/sis_research/9318 https://ink.library.smu.edu.sg/context/sis_research/article/10318/viewcontent/Alignment_Enriched_pv.pdf
_version_	1814047908298752000

Alignment-enriched tuning for patch-level pre-trained document image models

Similar Items