Towards understanding why Lookahead generalizes better than SGD and beyond

To train networks, lookahead algorithm [1] updates its fast weights k times via an inner-loop optimizer before updating its slow weights once by using the latest fast weights. Any optimizer, e.g. SGD, can serve as the inner-loop optimizer, and the derived lookahead generally enjoys remarkable test p...

Full description

Saved in:

Bibliographic Details
Main Authors:	ZHOU, Pan, YAN, Hanshu, YUAN, Xiaotong, FENG, Jiashi, YAN, Shuicheng
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2021
Subjects:	Theory and Algorithms
Online Access:	https://ink.library.smu.edu.sg/sis_research/8987 https://ink.library.smu.edu.sg/context/sis_research/article/9990/viewcontent/2021_NeurIPS_lookahead.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-9990
record_format	dspace
spelling	sg-smu-ink.sis_research-99902024-07-25T08:28:25Z Towards understanding why Lookahead generalizes better than SGD and beyond ZHOU, Pan YAN, Hanshu YUAN, Xiaotong FENG, Jiashi YAN, Shuicheng To train networks, lookahead algorithm [1] updates its fast weights k times via an inner-loop optimizer before updating its slow weights once by using the latest fast weights. Any optimizer, e.g. SGD, can serve as the inner-loop optimizer, and the derived lookahead generally enjoys remarkable test performance improvement over the vanilla optimizer. But theoretical understandings on the test performance improvement of lookahead remain absent yet. To solve this issue, we theoretically justify the advantages of lookahead in terms of the excess risk error which measures the test performance. Specifically, we prove that lookahead using SGD as its inner-loop optimizer can better balance the optimization error and generalization error to achieve smaller excess risk error than vanilla SGD on (strongly) convex problems and nonconvex problems with Polyak-Łojasiewicz condition which has been observed/proved in neural networks. Moreover, we show the stagewise optimization strategy [2] which decays learning rate several times during training can also benefit lookahead in improving its optimization and generalization errors on strongly convex problems. Finally, we propose a stagewise locally-regularized lookahead (SLRLA) algorithm which sums up the vanilla objective and a local regularizer to minimize at each stage and provably enjoys optimization and generalization improvement over the conventional (stagewise) lookahead. Experimental results on CIFAR10/100 and ImageNet testify its advantages. Codes is available at https://github.com/sail-sg/SLRLA-optimizer. 2021-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8987 https://ink.library.smu.edu.sg/context/sis_research/article/9990/viewcontent/2021_NeurIPS_lookahead.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Theory and Algorithms
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Theory and Algorithms
spellingShingle	Theory and Algorithms ZHOU, Pan YAN, Hanshu YUAN, Xiaotong FENG, Jiashi YAN, Shuicheng Towards understanding why Lookahead generalizes better than SGD and beyond
description	To train networks, lookahead algorithm [1] updates its fast weights k times via an inner-loop optimizer before updating its slow weights once by using the latest fast weights. Any optimizer, e.g. SGD, can serve as the inner-loop optimizer, and the derived lookahead generally enjoys remarkable test performance improvement over the vanilla optimizer. But theoretical understandings on the test performance improvement of lookahead remain absent yet. To solve this issue, we theoretically justify the advantages of lookahead in terms of the excess risk error which measures the test performance. Specifically, we prove that lookahead using SGD as its inner-loop optimizer can better balance the optimization error and generalization error to achieve smaller excess risk error than vanilla SGD on (strongly) convex problems and nonconvex problems with Polyak-Łojasiewicz condition which has been observed/proved in neural networks. Moreover, we show the stagewise optimization strategy [2] which decays learning rate several times during training can also benefit lookahead in improving its optimization and generalization errors on strongly convex problems. Finally, we propose a stagewise locally-regularized lookahead (SLRLA) algorithm which sums up the vanilla objective and a local regularizer to minimize at each stage and provably enjoys optimization and generalization improvement over the conventional (stagewise) lookahead. Experimental results on CIFAR10/100 and ImageNet testify its advantages. Codes is available at https://github.com/sail-sg/SLRLA-optimizer.
format	text
author	ZHOU, Pan YAN, Hanshu YUAN, Xiaotong FENG, Jiashi YAN, Shuicheng
author_facet	ZHOU, Pan YAN, Hanshu YUAN, Xiaotong FENG, Jiashi YAN, Shuicheng
author_sort	ZHOU, Pan
title	Towards understanding why Lookahead generalizes better than SGD and beyond
title_short	Towards understanding why Lookahead generalizes better than SGD and beyond
title_full	Towards understanding why Lookahead generalizes better than SGD and beyond
title_fullStr	Towards understanding why Lookahead generalizes better than SGD and beyond
title_full_unstemmed	Towards understanding why Lookahead generalizes better than SGD and beyond
title_sort	towards understanding why lookahead generalizes better than sgd and beyond
publisher	Institutional Knowledge at Singapore Management University
publishDate	2021
url	https://ink.library.smu.edu.sg/sis_research/8987 https://ink.library.smu.edu.sg/context/sis_research/article/9990/viewcontent/2021_NeurIPS_lookahead.pdf
_version_	1814047701394784256

Towards understanding why Lookahead generalizes better than SGD and beyond

Similar Items