Win: Weight-decay-integrated Nesterov acceleration for faster network training
Training deep networks on large-scale datasets is computationally challenging. This work explores the problem of “how to accelerate adaptive gradient algorithms in a general manner", and proposes an effective Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms....
Saved in:
Main Authors: | , , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2024
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/8969 https://ink.library.smu.edu.sg/context/sis_research/article/9972/viewcontent/2024JMLR.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-9972 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-99722024-07-17T06:51:17Z Win: Weight-decay-integrated Nesterov acceleration for faster network training ZHOU, Pan XIE, Xingyu LIN, Zhouchen TOH, Kim-Chuan YAN, Shuicheng Training deep networks on large-scale datasets is computationally challenging. This work explores the problem of “how to accelerate adaptive gradient algorithms in a general manner", and proposes an effective Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, per iteration, we construct a dynamical loss that combines the vanilla training loss and a dynamic regularizer inspired by proximal point method, and respectively minimize the first- and second-order Taylor approximations of dynamical loss to update variable. This yields our Win acceleration that uses a conservative step and an aggressive step to update, and linearly combines these two updates for acceleration. Next, we extend Win into Win2 which uses multiple aggressive update steps for faster convergence. Then we apply Win and Win2 to the popular LAMB and SGD optimizers. Our transparent derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we theoretically justify the faster convergence of Win- and Win2-accelerated AdamW, Adam and LAMB to their non-accelerated counterparts. Experimental results demonstrates the faster convergence speed and superior performance of our Win- and Win2-accelerated AdamW, Adam, LAMB and SGD over their vanilla counterparts on vision classification and language modeling tasks. 2024-03-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8969 https://ink.library.smu.edu.sg/context/sis_research/article/9972/viewcontent/2024JMLR.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Accelerated Adaptive Gradient Algorithms Deep Learning Optimizer Network Optimization Nesterov Acceleration in Deep Learning OS and Networks Theory and Algorithms |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Accelerated Adaptive Gradient Algorithms Deep Learning Optimizer Network Optimization Nesterov Acceleration in Deep Learning OS and Networks Theory and Algorithms |
spellingShingle |
Accelerated Adaptive Gradient Algorithms Deep Learning Optimizer Network Optimization Nesterov Acceleration in Deep Learning OS and Networks Theory and Algorithms ZHOU, Pan XIE, Xingyu LIN, Zhouchen TOH, Kim-Chuan YAN, Shuicheng Win: Weight-decay-integrated Nesterov acceleration for faster network training |
description |
Training deep networks on large-scale datasets is computationally challenging. This work explores the problem of “how to accelerate adaptive gradient algorithms in a general manner", and proposes an effective Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, per iteration, we construct a dynamical loss that combines the vanilla training loss and a dynamic regularizer inspired by proximal point method, and respectively minimize the first- and second-order Taylor approximations of dynamical loss to update variable. This yields our Win acceleration that uses a conservative step and an aggressive step to update, and linearly combines these two updates for acceleration. Next, we extend Win into Win2 which uses multiple aggressive update steps for faster convergence. Then we apply Win and Win2 to the popular LAMB and SGD optimizers. Our transparent derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we theoretically justify the faster convergence of Win- and Win2-accelerated AdamW, Adam and LAMB to their non-accelerated counterparts. Experimental results demonstrates the faster convergence speed and superior performance of our Win- and Win2-accelerated AdamW, Adam, LAMB and SGD over their vanilla counterparts on vision classification and language modeling tasks. |
format |
text |
author |
ZHOU, Pan XIE, Xingyu LIN, Zhouchen TOH, Kim-Chuan YAN, Shuicheng |
author_facet |
ZHOU, Pan XIE, Xingyu LIN, Zhouchen TOH, Kim-Chuan YAN, Shuicheng |
author_sort |
ZHOU, Pan |
title |
Win: Weight-decay-integrated Nesterov acceleration for faster network training |
title_short |
Win: Weight-decay-integrated Nesterov acceleration for faster network training |
title_full |
Win: Weight-decay-integrated Nesterov acceleration for faster network training |
title_fullStr |
Win: Weight-decay-integrated Nesterov acceleration for faster network training |
title_full_unstemmed |
Win: Weight-decay-integrated Nesterov acceleration for faster network training |
title_sort |
win: weight-decay-integrated nesterov acceleration for faster network training |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2024 |
url |
https://ink.library.smu.edu.sg/sis_research/8969 https://ink.library.smu.edu.sg/context/sis_research/article/9972/viewcontent/2024JMLR.pdf |
_version_ |
1814047697238228992 |