Alignment-enriched tuning for patch-level pre-trained document image models

Alignment between image and text has shown promising im provements on patch-level pre-trained document image mod els. However, investigating more effective or finer-grained alignment techniques during pre-training requires a large amount of computation cost and time. Thus, a question natu rally aris...

Full description

Saved in:
Bibliographic Details
Main Authors: WANG, Lei, HE, Jiabang, XU, Xing, LIU, Ning, LIU, Hui
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9318
https://ink.library.smu.edu.sg/context/sis_research/article/10318/viewcontent/Alignment_Enriched_pv.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
Description
Summary:Alignment between image and text has shown promising im provements on patch-level pre-trained document image mod els. However, investigating more effective or finer-grained alignment techniques during pre-training requires a large amount of computation cost and time. Thus, a question natu rally arises: Could we fine-tune the pre-trained models adap tive to downstream tasks with alignment objectives and achieve comparable or better performance? In this paper, we pro pose a new model architecture with alignment-enriched tuning (dubbed AETNet) upon pre-trained document image models, to adapt downstream tasks with the joint task-specific super vised and alignment-aware contrastive objective. Specifically, weintroduce an extra visual transformer as the alignment-ware image encoder and an extra text transformer as the alignment ware text encoder before multimodal fusion. We consider alignment in the following three aspects: 1) document-level alignment by leveraging the cross-modal and intra-modal con trastive loss; 2) global-local alignment for modeling localized and structural information in document images; and 3) local level alignment for more accurate patch-level information. Ex periments on various downstream tasks show that AETNet can achieve state-of-the-art performance on various downstream tasks. Notably, AETNet consistently outperforms state-of-the art pre-trained models, such as LayoutLMv3 with fine-tuning techniques, on three different downstream tasks. Code is available at https://github.com/MAEHCM/AET.