Towards understanding why mask reconstruction pretraining helps in downstream tasks

For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches, e.g. MAE (He et al., 2021) and data2vec (Baevski et al., 2022), randomly mask input patches and then reconstruct the pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, su...

Full description

Saved in:
Bibliographic Details
Main Authors: PAN, Jiachun, ZHOU, Pan, YAN, Shuicheng
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9022
https://ink.library.smu.edu.sg/context/sis_research/article/10025/viewcontent/2023_ICLR_MAE_Theory.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-10025
record_format dspace
spelling sg-smu-ink.sis_research-100252024-07-25T08:05:22Z Towards understanding why mask reconstruction pretraining helps in downstream tasks PAN, Jiachun ZHOU, Pan YAN, Shuicheng For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches, e.g. MAE (He et al., 2021) and data2vec (Baevski et al., 2022), randomly mask input patches and then reconstruct the pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional “supervised learning" (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic feature learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we first theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative features of each potential semantic class in the pretraining dataset. Then considering the fact that the pretraining dataset is of huge size and high diversity and thus covers most features in downstream dataset, in fine-tuning phase, the pretrained encoder can capture as much features as it can in downstream datasets, and would not lost these features with theoretical guarantees. In contrast, SL only randomly captures some features due to lottery ticket hypothesis. So MRP provably achieves better performance than SL on the classification tasks. Experimental results testify to our data assumptions and also our theoretical implications. 2023-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9022 https://ink.library.smu.edu.sg/context/sis_research/article/10025/viewcontent/2023_ICLR_MAE_Theory.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Graphics and Human Computer Interfaces
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Graphics and Human Computer Interfaces
spellingShingle Graphics and Human Computer Interfaces
PAN, Jiachun
ZHOU, Pan
YAN, Shuicheng
Towards understanding why mask reconstruction pretraining helps in downstream tasks
description For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches, e.g. MAE (He et al., 2021) and data2vec (Baevski et al., 2022), randomly mask input patches and then reconstruct the pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional “supervised learning" (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic feature learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we first theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative features of each potential semantic class in the pretraining dataset. Then considering the fact that the pretraining dataset is of huge size and high diversity and thus covers most features in downstream dataset, in fine-tuning phase, the pretrained encoder can capture as much features as it can in downstream datasets, and would not lost these features with theoretical guarantees. In contrast, SL only randomly captures some features due to lottery ticket hypothesis. So MRP provably achieves better performance than SL on the classification tasks. Experimental results testify to our data assumptions and also our theoretical implications.
format text
author PAN, Jiachun
ZHOU, Pan
YAN, Shuicheng
author_facet PAN, Jiachun
ZHOU, Pan
YAN, Shuicheng
author_sort PAN, Jiachun
title Towards understanding why mask reconstruction pretraining helps in downstream tasks
title_short Towards understanding why mask reconstruction pretraining helps in downstream tasks
title_full Towards understanding why mask reconstruction pretraining helps in downstream tasks
title_fullStr Towards understanding why mask reconstruction pretraining helps in downstream tasks
title_full_unstemmed Towards understanding why mask reconstruction pretraining helps in downstream tasks
title_sort towards understanding why mask reconstruction pretraining helps in downstream tasks
publisher Institutional Knowledge at Singapore Management University
publishDate 2023
url https://ink.library.smu.edu.sg/sis_research/9022
https://ink.library.smu.edu.sg/context/sis_research/article/10025/viewcontent/2023_ICLR_MAE_Theory.pdf
_version_ 1814047710389469184