Win: Weight-decay-integrated Nesterov acceleration for faster network training

Training deep networks on large-scale datasets is computationally challenging. This work explores the problem of “how to accelerate adaptive gradient algorithms in a general manner", and proposes an effective Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms....

Full description

Saved in:

Bibliographic Details
Main Authors:	ZHOU, Pan, XIE, Xingyu, LIN, Zhouchen, TOH, Kim-Chuan, YAN, Shuicheng
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Accelerated Adaptive Gradient Algorithms Deep Learning Optimizer Network Optimization Nesterov Acceleration in Deep Learning OS and Networks Theory and Algorithms
Online Access:	https://ink.library.smu.edu.sg/sis_research/8969 https://ink.library.smu.edu.sg/context/sis_research/article/9972/viewcontent/2024JMLR.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-9972
record_format	dspace
spelling	sg-smu-ink.sis_research-99722024-07-17T06:51:17Z Win: Weight-decay-integrated Nesterov acceleration for faster network training ZHOU, Pan XIE, Xingyu LIN, Zhouchen TOH, Kim-Chuan YAN, Shuicheng Training deep networks on large-scale datasets is computationally challenging. This work explores the problem of “how to accelerate adaptive gradient algorithms in a general manner", and proposes an effective Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, per iteration, we construct a dynamical loss that combines the vanilla training loss and a dynamic regularizer inspired by proximal point method, and respectively minimize the first- and second-order Taylor approximations of dynamical loss to update variable. This yields our Win acceleration that uses a conservative step and an aggressive step to update, and linearly combines these two updates for acceleration. Next, we extend Win into Win2 which uses multiple aggressive update steps for faster convergence. Then we apply Win and Win2 to the popular LAMB and SGD optimizers. Our transparent derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we theoretically justify the faster convergence of Win- and Win2-accelerated AdamW, Adam and LAMB to their non-accelerated counterparts. Experimental results demonstrates the faster convergence speed and superior performance of our Win- and Win2-accelerated AdamW, Adam, LAMB and SGD over their vanilla counterparts on vision classification and language modeling tasks. 2024-03-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8969 https://ink.library.smu.edu.sg/context/sis_research/article/9972/viewcontent/2024JMLR.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Accelerated Adaptive Gradient Algorithms Deep Learning Optimizer Network Optimization Nesterov Acceleration in Deep Learning OS and Networks Theory and Algorithms
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Accelerated Adaptive Gradient Algorithms Deep Learning Optimizer Network Optimization Nesterov Acceleration in Deep Learning OS and Networks Theory and Algorithms
spellingShingle	Accelerated Adaptive Gradient Algorithms Deep Learning Optimizer Network Optimization Nesterov Acceleration in Deep Learning OS and Networks Theory and Algorithms ZHOU, Pan XIE, Xingyu LIN, Zhouchen TOH, Kim-Chuan YAN, Shuicheng Win: Weight-decay-integrated Nesterov acceleration for faster network training
description	Training deep networks on large-scale datasets is computationally challenging. This work explores the problem of “how to accelerate adaptive gradient algorithms in a general manner", and proposes an effective Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, per iteration, we construct a dynamical loss that combines the vanilla training loss and a dynamic regularizer inspired by proximal point method, and respectively minimize the first- and second-order Taylor approximations of dynamical loss to update variable. This yields our Win acceleration that uses a conservative step and an aggressive step to update, and linearly combines these two updates for acceleration. Next, we extend Win into Win2 which uses multiple aggressive update steps for faster convergence. Then we apply Win and Win2 to the popular LAMB and SGD optimizers. Our transparent derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we theoretically justify the faster convergence of Win- and Win2-accelerated AdamW, Adam and LAMB to their non-accelerated counterparts. Experimental results demonstrates the faster convergence speed and superior performance of our Win- and Win2-accelerated AdamW, Adam, LAMB and SGD over their vanilla counterparts on vision classification and language modeling tasks.
format	text
author	ZHOU, Pan XIE, Xingyu LIN, Zhouchen TOH, Kim-Chuan YAN, Shuicheng
author_facet	ZHOU, Pan XIE, Xingyu LIN, Zhouchen TOH, Kim-Chuan YAN, Shuicheng
author_sort	ZHOU, Pan
title	Win: Weight-decay-integrated Nesterov acceleration for faster network training
title_short	Win: Weight-decay-integrated Nesterov acceleration for faster network training
title_full	Win: Weight-decay-integrated Nesterov acceleration for faster network training
title_fullStr	Win: Weight-decay-integrated Nesterov acceleration for faster network training
title_full_unstemmed	Win: Weight-decay-integrated Nesterov acceleration for faster network training
title_sort	win: weight-decay-integrated nesterov acceleration for faster network training
publisher	Institutional Knowledge at Singapore Management University
publishDate	2024
url	https://ink.library.smu.edu.sg/sis_research/8969 https://ink.library.smu.edu.sg/context/sis_research/article/9972/viewcontent/2024JMLR.pdf
_version_	1814047697238228992

Win: Weight-decay-integrated Nesterov acceleration for faster network training

Similar Items