Fully decoupled neural network learning using delayed gradients
Training neural networks with back-propagation (BP) requires a sequential passing of activations and gradients. This has been recognized as the lockings (i.e., the forward, backward, and update lockings) among modules (each module contains a stack of layers) inherited from the BP. In this paper,...
Saved in:
Main Authors: | , , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/174476 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Training neural networks with back-propagation
(BP) requires a sequential passing of activations and gradients.
This has been recognized as the lockings (i.e., the forward,
backward, and update lockings) among modules (each module
contains a stack of layers) inherited from the BP. In this paper, we
propose a fully decoupled training scheme using delayed gradients
(FDG) to break all these lockings. The FDG splits a neural
network into multiple modules and trains them independently
and asynchronously using different workers (e.g., GPUs). We
also introduce a gradient shrinking process to reduce the stale
gradient effect caused by the delayed gradients. Our theoretical
proofs show that the FDG can converge to critical points under
certain conditions. Experiments are conducted by training deep
convolutional neural networks to perform classification tasks on
several benchmark datasets. These experiments show comparable
or better results of our approach compared with the state-of-theart
methods in terms of generalization and acceleration. We also
show that the FDG is able to train various networks, including
extremely deep ones (e.g., ResNet-1202), in a decoupled fashion. |
---|