Alignment-enriched tuning for patch-level pre-trained document image models
Alignment between image and text has shown promising im provements on patch-level pre-trained document image mod els. However, investigating more effective or finer-grained alignment techniques during pre-training requires a large amount of computation cost and time. Thus, a question natu rally aris...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2023
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/9318 https://ink.library.smu.edu.sg/context/sis_research/article/10318/viewcontent/Alignment_Enriched_pv.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
Summary: | Alignment between image and text has shown promising im provements on patch-level pre-trained document image mod els. However, investigating more effective or finer-grained alignment techniques during pre-training requires a large amount of computation cost and time. Thus, a question natu rally arises: Could we fine-tune the pre-trained models adap tive to downstream tasks with alignment objectives and achieve comparable or better performance? In this paper, we pro pose a new model architecture with alignment-enriched tuning (dubbed AETNet) upon pre-trained document image models, to adapt downstream tasks with the joint task-specific super vised and alignment-aware contrastive objective. Specifically, weintroduce an extra visual transformer as the alignment-ware image encoder and an extra text transformer as the alignment ware text encoder before multimodal fusion. We consider alignment in the following three aspects: 1) document-level alignment by leveraging the cross-modal and intra-modal con trastive loss; 2) global-local alignment for modeling localized and structural information in document images; and 3) local level alignment for more accurate patch-level information. Ex periments on various downstream tasks show that AETNet can achieve state-of-the-art performance on various downstream tasks. Notably, AETNet consistently outperforms state-of-the art pre-trained models, such as LayoutLMv3 with fine-tuning techniques, on three different downstream tasks. Code is available at https://github.com/MAEHCM/AET. |
---|