Towards theoretically understanding why SGD generalizes better than ADAM in deep learning

It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the...

Full description

Saved in:

Bibliographic Details
Main Authors:	ZHOU, Pan, FENG, Jiashi, MA, Chao, XIONG, Caiming, HOI, Steven C. H., E, Weinan
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2020
Subjects:	Databases and Information Systems OS and Networks
Online Access:	https://ink.library.smu.edu.sg/sis_research/8999 https://ink.library.smu.edu.sg/context/sis_research/article/10002/viewcontent/2020_NeurIPS_Adam_Analysis.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-10002
record_format	dspace
spelling	sg-smu-ink.sis_research-100022024-07-25T08:19:25Z Towards theoretically understanding why SGD generalizes better than ADAM in deep learning ZHOU, Pan FENG, Jiashi MA, Chao XIONG, Caiming HOI, Steven C. H. E, Weinan It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their Lévy-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both SGD and ADAM depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, SGD enjoys smaller escaping time than ADAM, mainly because (a) the geometry adaptation in ADAM via adaptively scaling each gradient coordinate well diminishes the anisotropic structure in gradient noise and results in larger Radon measure of a basin; (b) the exponential gradient average in ADAM smooths its gradient and leads to lighter gradient noise tails than SGD. So SGD is more locally unstable than ADAM at sharp minima defined as the minima whose local basins have small Radon measure, and can better escape from them to flatter ones with larger Radon measure. As flat minima here which often refer to the minima at flat or asymmetric basins/valleys often generalize better than sharp ones [1, 2], our result explains the better generalization performance of SGD over ADAM. Finally, experimental results confirm our heavy-tailed gradient noise assumption and theoretical affirmation. 2020-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8999 https://ink.library.smu.edu.sg/context/sis_research/article/10002/viewcontent/2020_NeurIPS_Adam_Analysis.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Databases and Information Systems OS and Networks
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Databases and Information Systems OS and Networks
spellingShingle	Databases and Information Systems OS and Networks ZHOU, Pan FENG, Jiashi MA, Chao XIONG, Caiming HOI, Steven C. H. E, Weinan Towards theoretically understanding why SGD generalizes better than ADAM in deep learning
description	It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their Lévy-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both SGD and ADAM depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, SGD enjoys smaller escaping time than ADAM, mainly because (a) the geometry adaptation in ADAM via adaptively scaling each gradient coordinate well diminishes the anisotropic structure in gradient noise and results in larger Radon measure of a basin; (b) the exponential gradient average in ADAM smooths its gradient and leads to lighter gradient noise tails than SGD. So SGD is more locally unstable than ADAM at sharp minima defined as the minima whose local basins have small Radon measure, and can better escape from them to flatter ones with larger Radon measure. As flat minima here which often refer to the minima at flat or asymmetric basins/valleys often generalize better than sharp ones [1, 2], our result explains the better generalization performance of SGD over ADAM. Finally, experimental results confirm our heavy-tailed gradient noise assumption and theoretical affirmation.
format	text
author	ZHOU, Pan FENG, Jiashi MA, Chao XIONG, Caiming HOI, Steven C. H. E, Weinan
author_facet	ZHOU, Pan FENG, Jiashi MA, Chao XIONG, Caiming HOI, Steven C. H. E, Weinan
author_sort	ZHOU, Pan
title	Towards theoretically understanding why SGD generalizes better than ADAM in deep learning
title_short	Towards theoretically understanding why SGD generalizes better than ADAM in deep learning
title_full	Towards theoretically understanding why SGD generalizes better than ADAM in deep learning
title_fullStr	Towards theoretically understanding why SGD generalizes better than ADAM in deep learning
title_full_unstemmed	Towards theoretically understanding why SGD generalizes better than ADAM in deep learning
title_sort	towards theoretically understanding why sgd generalizes better than adam in deep learning
publisher	Institutional Knowledge at Singapore Management University
publishDate	2020
url	https://ink.library.smu.edu.sg/sis_research/8999 https://ink.library.smu.edu.sg/context/sis_research/article/10002/viewcontent/2020_NeurIPS_Adam_Analysis.pdf
_version_	1814047688118763520

Towards theoretically understanding why SGD generalizes better than ADAM in deep learning

Similar Items