Mathematical foundation of data science

High-dimensional probability theory bears vital importance in the mathematical foundation of data science. This project involves thoroughly reading a recent monograph “High-DimensionalProbability. An Introduction with Applications in Data Science” by Roman Vershynin. The book i...

Full description

Saved in:
Bibliographic Details
Main Author: Fang, Xiaowei
Other Authors: Li, Yi
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/139274
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:High-dimensional probability theory bears vital importance in the mathematical foundation of data science. This project involves thoroughly reading a recent monograph “High-DimensionalProbability. An Introduction with Applications in Data Science” by Roman Vershynin. The book integrates high-dimensional probability with applications in data science, coveringthe gap between mathematical sophistication and the theoretical methods used in modern re-search. Well-divided emphasis are placed on three parts - Concentration, Stochastic Processcesand Random Projection & Section. Chapter 1 - 6 acts as the backbone of the book. We firstsaw concentration inequalities involving random vectors, random matrices, and random projec-tions, from which applications about semidefi- nite programming and maximum cut for graphsare developed. We were then introduced to covering and packing arguments, which prompts theapplications about error correcting codes, community detection in networks, covariance estimationand clustering via the bounds on sub-gaussian random matrices. Then follows concentration ofLipschitz functions, which allows the establishment of Johnson-Linderstrauss Lemma, Communitydetection in sparse networks and covariance estimation for general distribution. We also learneddecoupling and symmetrization tricks. The application about matrix completion stems from them. The second part delineates the deduction process of bounding the expected supremum of randomprocesses, which would grease the wheels of the last part of the book. The theoretical tools includesa bunch of comparison inequalities for Gaussian processes and the technique of Gaussian interpo-lation, which helps us ferret out the bound on the operator norm of Gaussian random matrix anda lowered bound on the Gaussian width, as well as bounds on the diameter of random projectionof sets. Later, the method of chaining and the combinatorial reasoning based on the VC dimensionenables us to bound the subgaussian-incremented processes and random quadratic form, extending two applications about empirical processes and statistical learning theory. The last bulk of thebook commenced with a remarkably useful uniform deviation inequality for random matrices andrandom projections, whose consequences embrace several recoveries of the results proved earlierby different methods and two innovated results - M bound and the Escape theorem. Whereafter, we immerse ourselves in the application of recovery of sparse signals and low-rank matrices. Of particular interest is the Lasso algorithm for sparse regressions. Last but not least, equipped withthe geometry of low-dimensional random projections, we wrapped up the book with a glimpse ofGaussian images of sets, projections of ellipsoids and random projections in the Grassmannian. This report gives a solution manual to nearly all the exercises in the book, based on the 24 May 2019 version of the electronic copy (Chapter 1 - Chapter 6) and the hard copy (Chapter 7 -Chapter 11). The problems are self-contained, presented prior to each solution.