Win: Weight-decay-integrated Nesterov acceleration for faster network training

Training deep networks on large-scale datasets is computationally challenging. This work explores the problem of “how to accelerate adaptive gradient algorithms in a general manner", and proposes an effective Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms....

Full description

Saved in:
Bibliographic Details
Main Authors: ZHOU, Pan, XIE, Xingyu, LIN, Zhouchen, TOH, Kim-Chuan, YAN, Shuicheng
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8969
https://ink.library.smu.edu.sg/context/sis_research/article/9972/viewcontent/2024JMLR.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-9972
record_format dspace
spelling sg-smu-ink.sis_research-99722024-07-17T06:51:17Z Win: Weight-decay-integrated Nesterov acceleration for faster network training ZHOU, Pan XIE, Xingyu LIN, Zhouchen TOH, Kim-Chuan YAN, Shuicheng Training deep networks on large-scale datasets is computationally challenging. This work explores the problem of “how to accelerate adaptive gradient algorithms in a general manner", and proposes an effective Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, per iteration, we construct a dynamical loss that combines the vanilla training loss and a dynamic regularizer inspired by proximal point method, and respectively minimize the first- and second-order Taylor approximations of dynamical loss to update variable. This yields our Win acceleration that uses a conservative step and an aggressive step to update, and linearly combines these two updates for acceleration. Next, we extend Win into Win2 which uses multiple aggressive update steps for faster convergence. Then we apply Win and Win2 to the popular LAMB and SGD optimizers. Our transparent derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we theoretically justify the faster convergence of Win- and Win2-accelerated AdamW, Adam and LAMB to their non-accelerated counterparts. Experimental results demonstrates the faster convergence speed and superior performance of our Win- and Win2-accelerated AdamW, Adam, LAMB and SGD over their vanilla counterparts on vision classification and language modeling tasks. 2024-03-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8969 https://ink.library.smu.edu.sg/context/sis_research/article/9972/viewcontent/2024JMLR.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Accelerated Adaptive Gradient Algorithms Deep Learning Optimizer Network Optimization Nesterov Acceleration in Deep Learning OS and Networks Theory and Algorithms
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Accelerated Adaptive Gradient Algorithms
Deep Learning Optimizer
Network Optimization
Nesterov Acceleration in Deep Learning
OS and Networks
Theory and Algorithms
spellingShingle Accelerated Adaptive Gradient Algorithms
Deep Learning Optimizer
Network Optimization
Nesterov Acceleration in Deep Learning
OS and Networks
Theory and Algorithms
ZHOU, Pan
XIE, Xingyu
LIN, Zhouchen
TOH, Kim-Chuan
YAN, Shuicheng
Win: Weight-decay-integrated Nesterov acceleration for faster network training
description Training deep networks on large-scale datasets is computationally challenging. This work explores the problem of “how to accelerate adaptive gradient algorithms in a general manner", and proposes an effective Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, per iteration, we construct a dynamical loss that combines the vanilla training loss and a dynamic regularizer inspired by proximal point method, and respectively minimize the first- and second-order Taylor approximations of dynamical loss to update variable. This yields our Win acceleration that uses a conservative step and an aggressive step to update, and linearly combines these two updates for acceleration. Next, we extend Win into Win2 which uses multiple aggressive update steps for faster convergence. Then we apply Win and Win2 to the popular LAMB and SGD optimizers. Our transparent derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we theoretically justify the faster convergence of Win- and Win2-accelerated AdamW, Adam and LAMB to their non-accelerated counterparts. Experimental results demonstrates the faster convergence speed and superior performance of our Win- and Win2-accelerated AdamW, Adam, LAMB and SGD over their vanilla counterparts on vision classification and language modeling tasks.
format text
author ZHOU, Pan
XIE, Xingyu
LIN, Zhouchen
TOH, Kim-Chuan
YAN, Shuicheng
author_facet ZHOU, Pan
XIE, Xingyu
LIN, Zhouchen
TOH, Kim-Chuan
YAN, Shuicheng
author_sort ZHOU, Pan
title Win: Weight-decay-integrated Nesterov acceleration for faster network training
title_short Win: Weight-decay-integrated Nesterov acceleration for faster network training
title_full Win: Weight-decay-integrated Nesterov acceleration for faster network training
title_fullStr Win: Weight-decay-integrated Nesterov acceleration for faster network training
title_full_unstemmed Win: Weight-decay-integrated Nesterov acceleration for faster network training
title_sort win: weight-decay-integrated nesterov acceleration for faster network training
publisher Institutional Knowledge at Singapore Management University
publishDate 2024
url https://ink.library.smu.edu.sg/sis_research/8969
https://ink.library.smu.edu.sg/context/sis_research/article/9972/viewcontent/2024JMLR.pdf
_version_ 1814047697238228992