Identifying correlated heavy-hitters

The heavy hitter problem asks to find the top k most frequent elements in a data stream. This problem has been used in many applications across network data analysis, event mining, etc. Many classical algorithms can only handle one-dimensional data such as Count-Sketch and Count-Min. But in this stu...

Full description

Saved in:

Bibliographic Details
Main Author:	Zhou, Ziqi
Other Authors:	Li Yi
Format:	Final Year Project
Language:	English
Published:	Nanyang Technological University 2022
Subjects:	Science::Mathematics
Online Access:	https://hdl.handle.net/10356/156923
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-156923
record_format	dspace
spelling	sg-ntu-dr.10356-1569232023-02-28T23:13:08Z Identifying correlated heavy-hitters Zhou, Ziqi Li Yi School of Physical and Mathematical Sciences yi_li@ntu.edu.sg Science::Mathematics The heavy hitter problem asks to find the top k most frequent elements in a data stream. This problem has been used in many applications across network data analysis, event mining, etc. Many classical algorithms can only handle one-dimensional data such as Count-Sketch and Count-Min. But in this study, we focus on heavy hitters in two-dimensional data. Our goal is to identify the location of heavy hitters and estimate their value. In the first part, we use a two-sided Count-Sketch to estimate the value of heavy hitters. In the second part, we use error-correcting codes and hashing matrices to identify the location of heavy hitters. A two-sided Count-Sketch means applying Count-Sketch twice. First we apply Count-Sketch on the rows, hashing n rows into Θ(k poly(log n)) different buckets. With a large probability the heavy rows are isolated in different buckets and therefore their l2 norms are preserved. Next we apply Count-Sketch on the columns, allowing us to estimate the heavy entries in each row-bucket. The resulting matrix will have a much smaller dimension than the original matrix. Identification of heavy hitters is built upon Count-Sketch matrices and bit-testing matrices. We further incorporate error-correcting codes to reduce the failure probability. We also use a Johnson-Lindenstrauss matrix to estimate the l2 norms of the rows for identification of the heavy rows. Bachelor of Science in Mathematical Sciences 2022-04-28T11:22:31Z 2022-04-28T11:22:31Z 2022 Final Year Project (FYP) Zhou, Z. (2022). Identifying correlated heavy-hitters. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/156923 https://hdl.handle.net/10356/156923 en application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Science::Mathematics
spellingShingle	Science::Mathematics Zhou, Ziqi Identifying correlated heavy-hitters
description	The heavy hitter problem asks to find the top k most frequent elements in a data stream. This problem has been used in many applications across network data analysis, event mining, etc. Many classical algorithms can only handle one-dimensional data such as Count-Sketch and Count-Min. But in this study, we focus on heavy hitters in two-dimensional data. Our goal is to identify the location of heavy hitters and estimate their value. In the first part, we use a two-sided Count-Sketch to estimate the value of heavy hitters. In the second part, we use error-correcting codes and hashing matrices to identify the location of heavy hitters. A two-sided Count-Sketch means applying Count-Sketch twice. First we apply Count-Sketch on the rows, hashing n rows into Θ(k poly(log n)) different buckets. With a large probability the heavy rows are isolated in different buckets and therefore their l2 norms are preserved. Next we apply Count-Sketch on the columns, allowing us to estimate the heavy entries in each row-bucket. The resulting matrix will have a much smaller dimension than the original matrix. Identification of heavy hitters is built upon Count-Sketch matrices and bit-testing matrices. We further incorporate error-correcting codes to reduce the failure probability. We also use a Johnson-Lindenstrauss matrix to estimate the l2 norms of the rows for identification of the heavy rows.
author2	Li Yi
author_facet	Li Yi Zhou, Ziqi
format	Final Year Project
author	Zhou, Ziqi
author_sort	Zhou, Ziqi
title	Identifying correlated heavy-hitters
title_short	Identifying correlated heavy-hitters
title_full	Identifying correlated heavy-hitters
title_fullStr	Identifying correlated heavy-hitters
title_full_unstemmed	Identifying correlated heavy-hitters
title_sort	identifying correlated heavy-hitters
publisher	Nanyang Technological University
publishDate	2022
url	https://hdl.handle.net/10356/156923
_version_	1759854349550354432

Identifying correlated heavy-hitters

Similar Items