Win: Weight-decay-integrated nesterov acceleration for adaptive gradient algorithms

Training deep networks on large-scale datasets is computationally challenging. In this work, we explore the problem of “how to accelerate adaptive gradient algorithms in a general manner”, and aim to provide practical efficiency-boosting insights. To this end, we propose an effective and general Wei...

Full description

Saved in:

Bibliographic Details
Main Authors:	ZHOU, Pan, XIE, Xingyu, YAN, Shuicheng
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2023
Subjects:	Network optimizers Deep learning optimizer Deep learning algorithm Optimization acceleration in deep learning Deep Learning and representational learning OS and Networks Theory and Algorithms
Online Access:	https://ink.library.smu.edu.sg/sis_research/9056 https://ink.library.smu.edu.sg/context/sis_research/article/10059/viewcontent/653_win_weight_decay_integrated_ICLR.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-10059
record_format	dspace
spelling	sg-smu-ink.sis_research-100592024-08-01T15:36:45Z Win: Weight-decay-integrated nesterov acceleration for adaptive gradient algorithms ZHOU, Pan XIE, Xingyu YAN, Shuicheng Training deep networks on large-scale datasets is computationally challenging. In this work, we explore the problem of “how to accelerate adaptive gradient algorithms in a general manner”, and aim to provide practical efficiency-boosting insights. To this end, we propose an effective and general Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, we minimize a dynamical loss per iteration which combines the vanilla training loss and a dynamic regularizer inspired by proximal point method (PPM) to improve the convexity of the problem. To introduce Nesterov-alike-acceleration into AdamW and Adam, we respectively use the first- and second-order Taylor approximations of vanilla loss to update the variable twice. In this way, we arrive at our Win acceleration for AdamW and Adam that uses a conservative step and a reckless step to update twice and then linearly combines these two updates for acceleration. Next, we extend Win acceleration to LAMB and SGD. Our transparent acceleration derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we prove the convergence of Win-accelerated adaptive algorithms and justify their convergence superiority over their non-accelerated counterparts by taking AdamW and Adam as examples. Experimental results testify to the faster convergence speed and superior performance of our Win-accelerated AdamW, Adam, LAMB and SGD over their non-accelerated counterparts on vision classification tasks and language modeling tasks with both CNN and Transformer backbones. We hope Win shall be a default acceleration option for popular optimizers in deep learning community to improve the training efficiency. Code will be released at https://github.com/sail-sg/win. 2023-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9056 https://ink.library.smu.edu.sg/context/sis_research/article/10059/viewcontent/653_win_weight_decay_integrated_ICLR.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Network optimizers Deep learning optimizer Deep learning algorithm Optimization acceleration in deep learning Deep Learning and representational learning OS and Networks Theory and Algorithms
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Network optimizers Deep learning optimizer Deep learning algorithm Optimization acceleration in deep learning Deep Learning and representational learning OS and Networks Theory and Algorithms
spellingShingle	Network optimizers Deep learning optimizer Deep learning algorithm Optimization acceleration in deep learning Deep Learning and representational learning OS and Networks Theory and Algorithms ZHOU, Pan XIE, Xingyu YAN, Shuicheng Win: Weight-decay-integrated nesterov acceleration for adaptive gradient algorithms
description	Training deep networks on large-scale datasets is computationally challenging. In this work, we explore the problem of “how to accelerate adaptive gradient algorithms in a general manner”, and aim to provide practical efficiency-boosting insights. To this end, we propose an effective and general Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, we minimize a dynamical loss per iteration which combines the vanilla training loss and a dynamic regularizer inspired by proximal point method (PPM) to improve the convexity of the problem. To introduce Nesterov-alike-acceleration into AdamW and Adam, we respectively use the first- and second-order Taylor approximations of vanilla loss to update the variable twice. In this way, we arrive at our Win acceleration for AdamW and Adam that uses a conservative step and a reckless step to update twice and then linearly combines these two updates for acceleration. Next, we extend Win acceleration to LAMB and SGD. Our transparent acceleration derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we prove the convergence of Win-accelerated adaptive algorithms and justify their convergence superiority over their non-accelerated counterparts by taking AdamW and Adam as examples. Experimental results testify to the faster convergence speed and superior performance of our Win-accelerated AdamW, Adam, LAMB and SGD over their non-accelerated counterparts on vision classification tasks and language modeling tasks with both CNN and Transformer backbones. We hope Win shall be a default acceleration option for popular optimizers in deep learning community to improve the training efficiency. Code will be released at https://github.com/sail-sg/win.
format	text
author	ZHOU, Pan XIE, Xingyu YAN, Shuicheng
author_facet	ZHOU, Pan XIE, Xingyu YAN, Shuicheng
author_sort	ZHOU, Pan
title	Win: Weight-decay-integrated nesterov acceleration for adaptive gradient algorithms
title_short	Win: Weight-decay-integrated nesterov acceleration for adaptive gradient algorithms
title_full	Win: Weight-decay-integrated nesterov acceleration for adaptive gradient algorithms
title_fullStr	Win: Weight-decay-integrated nesterov acceleration for adaptive gradient algorithms
title_full_unstemmed	Win: Weight-decay-integrated nesterov acceleration for adaptive gradient algorithms
title_sort	win: weight-decay-integrated nesterov acceleration for adaptive gradient algorithms
publisher	Institutional Knowledge at Singapore Management University
publishDate	2023
url	https://ink.library.smu.edu.sg/sis_research/9056 https://ink.library.smu.edu.sg/context/sis_research/article/10059/viewcontent/653_win_weight_decay_integrated_ICLR.pdf
_version_	1814047719281393664

Win: Weight-decay-integrated nesterov acceleration for adaptive gradient algorithms

Similar Items