Mathematical foundation of data science

High-dimensional probability theory bears vital importance in the mathematical foundation of data science. This project involves thoroughly reading a recent monograph “High-DimensionalProbability. An Introduction with Applications in Data Science” by Roman Vershynin. The book i...

Full description

Saved in:
Bibliographic Details
Main Author: Fang, Xiaowei
Other Authors: Li, Yi
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/139274
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-139274
record_format dspace
spelling sg-ntu-dr.10356-1392742023-02-28T23:18:53Z Mathematical foundation of data science Fang, Xiaowei Li, Yi School of Physical and Mathematical Sciences yili@ntu.edu.sg Science::Mathematics High-dimensional probability theory bears vital importance in the mathematical foundation of data science. This project involves thoroughly reading a recent monograph “High-DimensionalProbability. An Introduction with Applications in Data Science” by Roman Vershynin. The book integrates high-dimensional probability with applications in data science, coveringthe gap between mathematical sophistication and the theoretical methods used in modern re-search. Well-divided emphasis are placed on three parts - Concentration, Stochastic Processcesand Random Projection & Section. Chapter 1 - 6 acts as the backbone of the book. We firstsaw concentration inequalities involving random vectors, random matrices, and random projec-tions, from which applications about semidefi- nite programming and maximum cut for graphsare developed. We were then introduced to covering and packing arguments, which prompts theapplications about error correcting codes, community detection in networks, covariance estimationand clustering via the bounds on sub-gaussian random matrices. Then follows concentration ofLipschitz functions, which allows the establishment of Johnson-Linderstrauss Lemma, Communitydetection in sparse networks and covariance estimation for general distribution. We also learneddecoupling and symmetrization tricks. The application about matrix completion stems from them. The second part delineates the deduction process of bounding the expected supremum of randomprocesses, which would grease the wheels of the last part of the book. The theoretical tools includesa bunch of comparison inequalities for Gaussian processes and the technique of Gaussian interpo-lation, which helps us ferret out the bound on the operator norm of Gaussian random matrix anda lowered bound on the Gaussian width, as well as bounds on the diameter of random projectionof sets. Later, the method of chaining and the combinatorial reasoning based on the VC dimensionenables us to bound the subgaussian-incremented processes and random quadratic form, extending two applications about empirical processes and statistical learning theory. The last bulk of thebook commenced with a remarkably useful uniform deviation inequality for random matrices andrandom projections, whose consequences embrace several recoveries of the results proved earlierby different methods and two innovated results - M bound and the Escape theorem. Whereafter, we immerse ourselves in the application of recovery of sparse signals and low-rank matrices. Of particular interest is the Lasso algorithm for sparse regressions. Last but not least, equipped withthe geometry of low-dimensional random projections, we wrapped up the book with a glimpse ofGaussian images of sets, projections of ellipsoids and random projections in the Grassmannian. This report gives a solution manual to nearly all the exercises in the book, based on the 24 May 2019 version of the electronic copy (Chapter 1 - Chapter 6) and the hard copy (Chapter 7 -Chapter 11). The problems are self-contained, presented prior to each solution. Bachelor of Science in Mathematical Sciences 2020-05-18T07:56:27Z 2020-05-18T07:56:27Z 2020 Final Year Project (FYP) https://hdl.handle.net/10356/139274 en application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Science::Mathematics
spellingShingle Science::Mathematics
Fang, Xiaowei
Mathematical foundation of data science
description High-dimensional probability theory bears vital importance in the mathematical foundation of data science. This project involves thoroughly reading a recent monograph “High-DimensionalProbability. An Introduction with Applications in Data Science” by Roman Vershynin. The book integrates high-dimensional probability with applications in data science, coveringthe gap between mathematical sophistication and the theoretical methods used in modern re-search. Well-divided emphasis are placed on three parts - Concentration, Stochastic Processcesand Random Projection & Section. Chapter 1 - 6 acts as the backbone of the book. We firstsaw concentration inequalities involving random vectors, random matrices, and random projec-tions, from which applications about semidefi- nite programming and maximum cut for graphsare developed. We were then introduced to covering and packing arguments, which prompts theapplications about error correcting codes, community detection in networks, covariance estimationand clustering via the bounds on sub-gaussian random matrices. Then follows concentration ofLipschitz functions, which allows the establishment of Johnson-Linderstrauss Lemma, Communitydetection in sparse networks and covariance estimation for general distribution. We also learneddecoupling and symmetrization tricks. The application about matrix completion stems from them. The second part delineates the deduction process of bounding the expected supremum of randomprocesses, which would grease the wheels of the last part of the book. The theoretical tools includesa bunch of comparison inequalities for Gaussian processes and the technique of Gaussian interpo-lation, which helps us ferret out the bound on the operator norm of Gaussian random matrix anda lowered bound on the Gaussian width, as well as bounds on the diameter of random projectionof sets. Later, the method of chaining and the combinatorial reasoning based on the VC dimensionenables us to bound the subgaussian-incremented processes and random quadratic form, extending two applications about empirical processes and statistical learning theory. The last bulk of thebook commenced with a remarkably useful uniform deviation inequality for random matrices andrandom projections, whose consequences embrace several recoveries of the results proved earlierby different methods and two innovated results - M bound and the Escape theorem. Whereafter, we immerse ourselves in the application of recovery of sparse signals and low-rank matrices. Of particular interest is the Lasso algorithm for sparse regressions. Last but not least, equipped withthe geometry of low-dimensional random projections, we wrapped up the book with a glimpse ofGaussian images of sets, projections of ellipsoids and random projections in the Grassmannian. This report gives a solution manual to nearly all the exercises in the book, based on the 24 May 2019 version of the electronic copy (Chapter 1 - Chapter 6) and the hard copy (Chapter 7 -Chapter 11). The problems are self-contained, presented prior to each solution.
author2 Li, Yi
author_facet Li, Yi
Fang, Xiaowei
format Final Year Project
author Fang, Xiaowei
author_sort Fang, Xiaowei
title Mathematical foundation of data science
title_short Mathematical foundation of data science
title_full Mathematical foundation of data science
title_fullStr Mathematical foundation of data science
title_full_unstemmed Mathematical foundation of data science
title_sort mathematical foundation of data science
publisher Nanyang Technological University
publishDate 2020
url https://hdl.handle.net/10356/139274
_version_ 1759857907349848064