Identifying correlated heavy-hitters

The heavy hitter problem asks to find the top k most frequent elements in a data stream. This problem has been used in many applications across network data analysis, event mining, etc. Many classical algorithms can only handle one-dimensional data such as Count-Sketch and Count-Min. But in this stu...

Full description

Saved in:
Bibliographic Details
Main Author: Zhou, Ziqi
Other Authors: Li Yi
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/156923
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-156923
record_format dspace
spelling sg-ntu-dr.10356-1569232023-02-28T23:13:08Z Identifying correlated heavy-hitters Zhou, Ziqi Li Yi School of Physical and Mathematical Sciences yi_li@ntu.edu.sg Science::Mathematics The heavy hitter problem asks to find the top k most frequent elements in a data stream. This problem has been used in many applications across network data analysis, event mining, etc. Many classical algorithms can only handle one-dimensional data such as Count-Sketch and Count-Min. But in this study, we focus on heavy hitters in two-dimensional data. Our goal is to identify the location of heavy hitters and estimate their value. In the first part, we use a two-sided Count-Sketch to estimate the value of heavy hitters. In the second part, we use error-correcting codes and hashing matrices to identify the location of heavy hitters. A two-sided Count-Sketch means applying Count-Sketch twice. First we apply Count-Sketch on the rows, hashing n rows into Θ(k poly(log n)) different buckets. With a large probability the heavy rows are isolated in different buckets and therefore their l2 norms are preserved. Next we apply Count-Sketch on the columns, allowing us to estimate the heavy entries in each row-bucket. The resulting matrix will have a much smaller dimension than the original matrix. Identification of heavy hitters is built upon Count-Sketch matrices and bit-testing matrices. We further incorporate error-correcting codes to reduce the failure probability. We also use a Johnson-Lindenstrauss matrix to estimate the l2 norms of the rows for identification of the heavy rows. Bachelor of Science in Mathematical Sciences 2022-04-28T11:22:31Z 2022-04-28T11:22:31Z 2022 Final Year Project (FYP) Zhou, Z. (2022). Identifying correlated heavy-hitters. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/156923 https://hdl.handle.net/10356/156923 en application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Science::Mathematics
spellingShingle Science::Mathematics
Zhou, Ziqi
Identifying correlated heavy-hitters
description The heavy hitter problem asks to find the top k most frequent elements in a data stream. This problem has been used in many applications across network data analysis, event mining, etc. Many classical algorithms can only handle one-dimensional data such as Count-Sketch and Count-Min. But in this study, we focus on heavy hitters in two-dimensional data. Our goal is to identify the location of heavy hitters and estimate their value. In the first part, we use a two-sided Count-Sketch to estimate the value of heavy hitters. In the second part, we use error-correcting codes and hashing matrices to identify the location of heavy hitters. A two-sided Count-Sketch means applying Count-Sketch twice. First we apply Count-Sketch on the rows, hashing n rows into Θ(k poly(log n)) different buckets. With a large probability the heavy rows are isolated in different buckets and therefore their l2 norms are preserved. Next we apply Count-Sketch on the columns, allowing us to estimate the heavy entries in each row-bucket. The resulting matrix will have a much smaller dimension than the original matrix. Identification of heavy hitters is built upon Count-Sketch matrices and bit-testing matrices. We further incorporate error-correcting codes to reduce the failure probability. We also use a Johnson-Lindenstrauss matrix to estimate the l2 norms of the rows for identification of the heavy rows.
author2 Li Yi
author_facet Li Yi
Zhou, Ziqi
format Final Year Project
author Zhou, Ziqi
author_sort Zhou, Ziqi
title Identifying correlated heavy-hitters
title_short Identifying correlated heavy-hitters
title_full Identifying correlated heavy-hitters
title_fullStr Identifying correlated heavy-hitters
title_full_unstemmed Identifying correlated heavy-hitters
title_sort identifying correlated heavy-hitters
publisher Nanyang Technological University
publishDate 2022
url https://hdl.handle.net/10356/156923
_version_ 1759854349550354432