Fully decoupled neural network learning using delayed gradients

Training neural networks with back-propagation (BP) requires a sequential passing of activations and gradients. This has been recognized as the lockings (i.e., the forward, backward, and update lockings) among modules (each module contains a stack of layers) inherited from the BP. In this paper,...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhuang, Huiping, Wang, Yi, Liu, Qinglai, Lin, Zhiping
Other Authors: School of Electrical and Electronic Engineering
Format: Article
Language:English
Published: 2024
Subjects:
Online Access:https://hdl.handle.net/10356/174476
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Training neural networks with back-propagation (BP) requires a sequential passing of activations and gradients. This has been recognized as the lockings (i.e., the forward, backward, and update lockings) among modules (each module contains a stack of layers) inherited from the BP. In this paper, we propose a fully decoupled training scheme using delayed gradients (FDG) to break all these lockings. The FDG splits a neural network into multiple modules and trains them independently and asynchronously using different workers (e.g., GPUs). We also introduce a gradient shrinking process to reduce the stale gradient effect caused by the delayed gradients. Our theoretical proofs show that the FDG can converge to critical points under certain conditions. Experiments are conducted by training deep convolutional neural networks to perform classification tasks on several benchmark datasets. These experiments show comparable or better results of our approach compared with the state-of-theart methods in terms of generalization and acceleration. We also show that the FDG is able to train various networks, including extremely deep ones (e.g., ResNet-1202), in a decoupled fashion.