Towards understanding convergence and generalization of AdamW

AdamW modifies Adam by adding a decoupled weight decay to decay network weights per training iteration. For adaptive algorithms, this decoupled weight decay does not affect specific optimization steps, and differs from the widely used ℓ2-regularizer which changes optimization steps via changing the...

Full description

Saved in:

Bibliographic Details
Main Authors:	ZHOU, Pan, XIE, Xingyu, LIN, Zhouchen, YAN, Shuicheng
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Analysis of AdamW Convergence of AdamW Generalization of AdamW Adaptive gradient algorithms Graphics and Human Computer Interfaces
Online Access:	https://ink.library.smu.edu.sg/sis_research/8986 https://ink.library.smu.edu.sg/context/sis_research/article/9989/viewcontent/2023_TPAMI_AdamW_Analysis.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-9989
record_format	dspace
spelling	sg-smu-ink.sis_research-99892024-07-25T08:29:38Z Towards understanding convergence and generalization of AdamW ZHOU, Pan XIE, Xingyu LIN, Zhouchen YAN, Shuicheng AdamW modifies Adam by adding a decoupled weight decay to decay network weights per training iteration. For adaptive algorithms, this decoupled weight decay does not affect specific optimization steps, and differs from the widely used ℓ2-regularizer which changes optimization steps via changing the first- and second-order gradient moments. Despite its great practical success, for AdamW, its convergence behavior and generalization improvement over Adam and ℓ2-regularized Adam (ℓ2-Adam) remain absent yet. To solve this issue, we prove the convergence of AdamW and justify its generalization advantages over Adam and ℓ2-Adam. Specifically, AdamW provably converges but minimizes a dynamically regularized loss that combines vanilla loss and a dynamical regularization induced by decoupled weight decay, thus yielding different behaviors with Adam and ℓ2-Adam. Moreover, on both general nonconvex problems and PŁ-conditioned problems, we establish stochastic gradient complexity of AdamW to find a stationary point. Such complexity is also applicable to Adam and ℓ2-Adam, and improves their previously known complexity, especially for over-parametrized networks. Besides, we prove that AdamW enjoys smaller generalization errors than Adam and ℓ2-Adam from the Bayesian posterior aspect. This result, for the first time, explicitly reveals the benefits of decoupled weight decay in AdamW. Experimental results validate our theory. 2024-03-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8986 info:doi/10.1109/TPAMI.2024.3382294 https://ink.library.smu.edu.sg/context/sis_research/article/9989/viewcontent/2023_TPAMI_AdamW_Analysis.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Analysis of AdamW Convergence of AdamW Generalization of AdamW Adaptive gradient algorithms Graphics and Human Computer Interfaces
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Analysis of AdamW Convergence of AdamW Generalization of AdamW Adaptive gradient algorithms Graphics and Human Computer Interfaces
spellingShingle	Analysis of AdamW Convergence of AdamW Generalization of AdamW Adaptive gradient algorithms Graphics and Human Computer Interfaces ZHOU, Pan XIE, Xingyu LIN, Zhouchen YAN, Shuicheng Towards understanding convergence and generalization of AdamW
description	AdamW modifies Adam by adding a decoupled weight decay to decay network weights per training iteration. For adaptive algorithms, this decoupled weight decay does not affect specific optimization steps, and differs from the widely used ℓ2-regularizer which changes optimization steps via changing the first- and second-order gradient moments. Despite its great practical success, for AdamW, its convergence behavior and generalization improvement over Adam and ℓ2-regularized Adam (ℓ2-Adam) remain absent yet. To solve this issue, we prove the convergence of AdamW and justify its generalization advantages over Adam and ℓ2-Adam. Specifically, AdamW provably converges but minimizes a dynamically regularized loss that combines vanilla loss and a dynamical regularization induced by decoupled weight decay, thus yielding different behaviors with Adam and ℓ2-Adam. Moreover, on both general nonconvex problems and PŁ-conditioned problems, we establish stochastic gradient complexity of AdamW to find a stationary point. Such complexity is also applicable to Adam and ℓ2-Adam, and improves their previously known complexity, especially for over-parametrized networks. Besides, we prove that AdamW enjoys smaller generalization errors than Adam and ℓ2-Adam from the Bayesian posterior aspect. This result, for the first time, explicitly reveals the benefits of decoupled weight decay in AdamW. Experimental results validate our theory.
format	text
author	ZHOU, Pan XIE, Xingyu LIN, Zhouchen YAN, Shuicheng
author_facet	ZHOU, Pan XIE, Xingyu LIN, Zhouchen YAN, Shuicheng
author_sort	ZHOU, Pan
title	Towards understanding convergence and generalization of AdamW
title_short	Towards understanding convergence and generalization of AdamW
title_full	Towards understanding convergence and generalization of AdamW
title_fullStr	Towards understanding convergence and generalization of AdamW
title_full_unstemmed	Towards understanding convergence and generalization of AdamW
title_sort	towards understanding convergence and generalization of adamw
publisher	Institutional Knowledge at Singapore Management University
publishDate	2024
url	https://ink.library.smu.edu.sg/sis_research/8986 https://ink.library.smu.edu.sg/context/sis_research/article/9989/viewcontent/2023_TPAMI_AdamW_Analysis.pdf
_version_	1814047701131591680

Towards understanding convergence and generalization of AdamW

Similar Items