Towards theoretically understanding why SGD generalizes better than ADAM in deep learning

It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the...

Full description

Saved in:
Bibliographic Details
Main Authors: ZHOU, Pan, FENG, Jiashi, MA, Chao, XIONG, Caiming, HOI, Steven C. H., E, Weinan
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2020
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8999
https://ink.library.smu.edu.sg/context/sis_research/article/10002/viewcontent/2020_NeurIPS_Adam_Analysis.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-10002
record_format dspace
spelling sg-smu-ink.sis_research-100022024-07-25T08:19:25Z Towards theoretically understanding why SGD generalizes better than ADAM in deep learning ZHOU, Pan FENG, Jiashi MA, Chao XIONG, Caiming HOI, Steven C. H. E, Weinan It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their Lévy-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both SGD and ADAM depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, SGD enjoys smaller escaping time than ADAM, mainly because (a) the geometry adaptation in ADAM via adaptively scaling each gradient coordinate well diminishes the anisotropic structure in gradient noise and results in larger Radon measure of a basin; (b) the exponential gradient average in ADAM smooths its gradient and leads to lighter gradient noise tails than SGD. So SGD is more locally unstable than ADAM at sharp minima defined as the minima whose local basins have small Radon measure, and can better escape from them to flatter ones with larger Radon measure. As flat minima here which often refer to the minima at flat or asymmetric basins/valleys often generalize better than sharp ones [1, 2], our result explains the better generalization performance of SGD over ADAM. Finally, experimental results confirm our heavy-tailed gradient noise assumption and theoretical affirmation. 2020-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8999 https://ink.library.smu.edu.sg/context/sis_research/article/10002/viewcontent/2020_NeurIPS_Adam_Analysis.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Databases and Information Systems OS and Networks
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Databases and Information Systems
OS and Networks
spellingShingle Databases and Information Systems
OS and Networks
ZHOU, Pan
FENG, Jiashi
MA, Chao
XIONG, Caiming
HOI, Steven C. H.
E, Weinan
Towards theoretically understanding why SGD generalizes better than ADAM in deep learning
description It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their Lévy-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both SGD and ADAM depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, SGD enjoys smaller escaping time than ADAM, mainly because (a) the geometry adaptation in ADAM via adaptively scaling each gradient coordinate well diminishes the anisotropic structure in gradient noise and results in larger Radon measure of a basin; (b) the exponential gradient average in ADAM smooths its gradient and leads to lighter gradient noise tails than SGD. So SGD is more locally unstable than ADAM at sharp minima defined as the minima whose local basins have small Radon measure, and can better escape from them to flatter ones with larger Radon measure. As flat minima here which often refer to the minima at flat or asymmetric basins/valleys often generalize better than sharp ones [1, 2], our result explains the better generalization performance of SGD over ADAM. Finally, experimental results confirm our heavy-tailed gradient noise assumption and theoretical affirmation.
format text
author ZHOU, Pan
FENG, Jiashi
MA, Chao
XIONG, Caiming
HOI, Steven C. H.
E, Weinan
author_facet ZHOU, Pan
FENG, Jiashi
MA, Chao
XIONG, Caiming
HOI, Steven C. H.
E, Weinan
author_sort ZHOU, Pan
title Towards theoretically understanding why SGD generalizes better than ADAM in deep learning
title_short Towards theoretically understanding why SGD generalizes better than ADAM in deep learning
title_full Towards theoretically understanding why SGD generalizes better than ADAM in deep learning
title_fullStr Towards theoretically understanding why SGD generalizes better than ADAM in deep learning
title_full_unstemmed Towards theoretically understanding why SGD generalizes better than ADAM in deep learning
title_sort towards theoretically understanding why sgd generalizes better than adam in deep learning
publisher Institutional Knowledge at Singapore Management University
publishDate 2020
url https://ink.library.smu.edu.sg/sis_research/8999
https://ink.library.smu.edu.sg/context/sis_research/article/10002/viewcontent/2020_NeurIPS_Adam_Analysis.pdf
_version_ 1814047688118763520