Mathematical foundation of data science
High-dimensional probability theory bears vital importance in the mathematical foundation of data science. This project involves thoroughly reading a recent monograph “High-DimensionalProbability. An Introduction with Applications in Data Science” by Roman Vershynin. The book i...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2020
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/139274 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-139274 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1392742023-02-28T23:18:53Z Mathematical foundation of data science Fang, Xiaowei Li, Yi School of Physical and Mathematical Sciences yili@ntu.edu.sg Science::Mathematics High-dimensional probability theory bears vital importance in the mathematical foundation of data science. This project involves thoroughly reading a recent monograph “High-DimensionalProbability. An Introduction with Applications in Data Science” by Roman Vershynin. The book integrates high-dimensional probability with applications in data science, coveringthe gap between mathematical sophistication and the theoretical methods used in modern re-search. Well-divided emphasis are placed on three parts - Concentration, Stochastic Processcesand Random Projection & Section. Chapter 1 - 6 acts as the backbone of the book. We firstsaw concentration inequalities involving random vectors, random matrices, and random projec-tions, from which applications about semidefi- nite programming and maximum cut for graphsare developed. We were then introduced to covering and packing arguments, which prompts theapplications about error correcting codes, community detection in networks, covariance estimationand clustering via the bounds on sub-gaussian random matrices. Then follows concentration ofLipschitz functions, which allows the establishment of Johnson-Linderstrauss Lemma, Communitydetection in sparse networks and covariance estimation for general distribution. We also learneddecoupling and symmetrization tricks. The application about matrix completion stems from them. The second part delineates the deduction process of bounding the expected supremum of randomprocesses, which would grease the wheels of the last part of the book. The theoretical tools includesa bunch of comparison inequalities for Gaussian processes and the technique of Gaussian interpo-lation, which helps us ferret out the bound on the operator norm of Gaussian random matrix anda lowered bound on the Gaussian width, as well as bounds on the diameter of random projectionof sets. Later, the method of chaining and the combinatorial reasoning based on the VC dimensionenables us to bound the subgaussian-incremented processes and random quadratic form, extending two applications about empirical processes and statistical learning theory. The last bulk of thebook commenced with a remarkably useful uniform deviation inequality for random matrices andrandom projections, whose consequences embrace several recoveries of the results proved earlierby different methods and two innovated results - M bound and the Escape theorem. Whereafter, we immerse ourselves in the application of recovery of sparse signals and low-rank matrices. Of particular interest is the Lasso algorithm for sparse regressions. Last but not least, equipped withthe geometry of low-dimensional random projections, we wrapped up the book with a glimpse ofGaussian images of sets, projections of ellipsoids and random projections in the Grassmannian. This report gives a solution manual to nearly all the exercises in the book, based on the 24 May 2019 version of the electronic copy (Chapter 1 - Chapter 6) and the hard copy (Chapter 7 -Chapter 11). The problems are self-contained, presented prior to each solution. Bachelor of Science in Mathematical Sciences 2020-05-18T07:56:27Z 2020-05-18T07:56:27Z 2020 Final Year Project (FYP) https://hdl.handle.net/10356/139274 en application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Science::Mathematics |
spellingShingle |
Science::Mathematics Fang, Xiaowei Mathematical foundation of data science |
description |
High-dimensional probability theory bears vital importance in the mathematical foundation of data science. This project involves thoroughly reading a recent monograph “High-DimensionalProbability. An Introduction with Applications in Data Science” by Roman Vershynin. The book integrates high-dimensional probability with applications in data science, coveringthe gap between mathematical sophistication and the theoretical methods used in modern re-search. Well-divided emphasis are placed on three parts - Concentration, Stochastic Processcesand Random Projection & Section. Chapter 1 - 6 acts as the backbone of the book. We firstsaw concentration inequalities involving random vectors, random matrices, and random projec-tions, from which applications about semidefi- nite programming and maximum cut for graphsare developed. We were then introduced to covering and packing arguments, which prompts theapplications about error correcting codes, community detection in networks, covariance estimationand clustering via the bounds on sub-gaussian random matrices. Then follows concentration ofLipschitz functions, which allows the establishment of Johnson-Linderstrauss Lemma, Communitydetection in sparse networks and covariance estimation for general distribution. We also learneddecoupling and symmetrization tricks. The application about matrix completion stems from them. The second part delineates the deduction process of bounding the expected supremum of randomprocesses, which would grease the wheels of the last part of the book. The theoretical tools includesa bunch of comparison inequalities for Gaussian processes and the technique of Gaussian interpo-lation, which helps us ferret out the bound on the operator norm of Gaussian random matrix anda lowered bound on the Gaussian width, as well as bounds on the diameter of random projectionof sets. Later, the method of chaining and the combinatorial reasoning based on the VC dimensionenables us to bound the subgaussian-incremented processes and random quadratic form, extending two applications about empirical processes and statistical learning theory. The last bulk of thebook commenced with a remarkably useful uniform deviation inequality for random matrices andrandom projections, whose consequences embrace several recoveries of the results proved earlierby different methods and two innovated results - M bound and the Escape theorem. Whereafter, we immerse ourselves in the application of recovery of sparse signals and low-rank matrices. Of particular interest is the Lasso algorithm for sparse regressions. Last but not least, equipped withthe geometry of low-dimensional random projections, we wrapped up the book with a glimpse ofGaussian images of sets, projections of ellipsoids and random projections in the Grassmannian. This report gives a solution manual to nearly all the exercises in the book, based on the 24 May 2019 version of the electronic copy (Chapter 1 - Chapter 6) and the hard copy (Chapter 7 -Chapter 11). The problems are self-contained, presented prior to each solution. |
author2 |
Li, Yi |
author_facet |
Li, Yi Fang, Xiaowei |
format |
Final Year Project |
author |
Fang, Xiaowei |
author_sort |
Fang, Xiaowei |
title |
Mathematical foundation of data science |
title_short |
Mathematical foundation of data science |
title_full |
Mathematical foundation of data science |
title_fullStr |
Mathematical foundation of data science |
title_full_unstemmed |
Mathematical foundation of data science |
title_sort |
mathematical foundation of data science |
publisher |
Nanyang Technological University |
publishDate |
2020 |
url |
https://hdl.handle.net/10356/139274 |
_version_ |
1759857907349848064 |