Alignment-enriched tuning for patch-level pre-trained document image models

Alignment between image and text has shown promising im provements on patch-level pre-trained document image mod els. However, investigating more effective or finer-grained alignment techniques during pre-training requires a large amount of computation cost and time. Thus, a question natu rally aris...

Full description

Saved in:
Bibliographic Details
Main Authors: WANG, Lei, HE, Jiabang, XU, Xing, LIU, Ning, LIU, Hui
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9318
https://ink.library.smu.edu.sg/context/sis_research/article/10318/viewcontent/Alignment_Enriched_pv.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-10318
record_format dspace
spelling sg-smu-ink.sis_research-103182024-09-26T07:54:56Z Alignment-enriched tuning for patch-level pre-trained document image models WANG, Lei HE, Jiabang XU, Xing LIU, Ning LIU, Hui Alignment between image and text has shown promising im provements on patch-level pre-trained document image mod els. However, investigating more effective or finer-grained alignment techniques during pre-training requires a large amount of computation cost and time. Thus, a question natu rally arises: Could we fine-tune the pre-trained models adap tive to downstream tasks with alignment objectives and achieve comparable or better performance? In this paper, we pro pose a new model architecture with alignment-enriched tuning (dubbed AETNet) upon pre-trained document image models, to adapt downstream tasks with the joint task-specific super vised and alignment-aware contrastive objective. Specifically, weintroduce an extra visual transformer as the alignment-ware image encoder and an extra text transformer as the alignment ware text encoder before multimodal fusion. We consider alignment in the following three aspects: 1) document-level alignment by leveraging the cross-modal and intra-modal con trastive loss; 2) global-local alignment for modeling localized and structural information in document images; and 3) local level alignment for more accurate patch-level information. Ex periments on various downstream tasks show that AETNet can achieve state-of-the-art performance on various downstream tasks. Notably, AETNet consistently outperforms state-of-the art pre-trained models, such as LayoutLMv3 with fine-tuning techniques, on three different downstream tasks. Code is available at https://github.com/MAEHCM/AET. 2023-02-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9318 https://ink.library.smu.edu.sg/context/sis_research/article/10318/viewcontent/Alignment_Enriched_pv.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Artificial Intelligence and Robotics Databases and Information Systems
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Artificial Intelligence and Robotics
Databases and Information Systems
spellingShingle Artificial Intelligence and Robotics
Databases and Information Systems
WANG, Lei
HE, Jiabang
XU, Xing
LIU, Ning
LIU, Hui
Alignment-enriched tuning for patch-level pre-trained document image models
description Alignment between image and text has shown promising im provements on patch-level pre-trained document image mod els. However, investigating more effective or finer-grained alignment techniques during pre-training requires a large amount of computation cost and time. Thus, a question natu rally arises: Could we fine-tune the pre-trained models adap tive to downstream tasks with alignment objectives and achieve comparable or better performance? In this paper, we pro pose a new model architecture with alignment-enriched tuning (dubbed AETNet) upon pre-trained document image models, to adapt downstream tasks with the joint task-specific super vised and alignment-aware contrastive objective. Specifically, weintroduce an extra visual transformer as the alignment-ware image encoder and an extra text transformer as the alignment ware text encoder before multimodal fusion. We consider alignment in the following three aspects: 1) document-level alignment by leveraging the cross-modal and intra-modal con trastive loss; 2) global-local alignment for modeling localized and structural information in document images; and 3) local level alignment for more accurate patch-level information. Ex periments on various downstream tasks show that AETNet can achieve state-of-the-art performance on various downstream tasks. Notably, AETNet consistently outperforms state-of-the art pre-trained models, such as LayoutLMv3 with fine-tuning techniques, on three different downstream tasks. Code is available at https://github.com/MAEHCM/AET.
format text
author WANG, Lei
HE, Jiabang
XU, Xing
LIU, Ning
LIU, Hui
author_facet WANG, Lei
HE, Jiabang
XU, Xing
LIU, Ning
LIU, Hui
author_sort WANG, Lei
title Alignment-enriched tuning for patch-level pre-trained document image models
title_short Alignment-enriched tuning for patch-level pre-trained document image models
title_full Alignment-enriched tuning for patch-level pre-trained document image models
title_fullStr Alignment-enriched tuning for patch-level pre-trained document image models
title_full_unstemmed Alignment-enriched tuning for patch-level pre-trained document image models
title_sort alignment-enriched tuning for patch-level pre-trained document image models
publisher Institutional Knowledge at Singapore Management University
publishDate 2023
url https://ink.library.smu.edu.sg/sis_research/9318
https://ink.library.smu.edu.sg/context/sis_research/article/10318/viewcontent/Alignment_Enriched_pv.pdf
_version_ 1814047908298752000