Identifying correlated heavy-hitters
The heavy hitter problem asks to find the top k most frequent elements in a data stream. This problem has been used in many applications across network data analysis, event mining, etc. Many classical algorithms can only handle one-dimensional data such as Count-Sketch and Count-Min. But in this stu...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/156923 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-156923 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1569232023-02-28T23:13:08Z Identifying correlated heavy-hitters Zhou, Ziqi Li Yi School of Physical and Mathematical Sciences yi_li@ntu.edu.sg Science::Mathematics The heavy hitter problem asks to find the top k most frequent elements in a data stream. This problem has been used in many applications across network data analysis, event mining, etc. Many classical algorithms can only handle one-dimensional data such as Count-Sketch and Count-Min. But in this study, we focus on heavy hitters in two-dimensional data. Our goal is to identify the location of heavy hitters and estimate their value. In the first part, we use a two-sided Count-Sketch to estimate the value of heavy hitters. In the second part, we use error-correcting codes and hashing matrices to identify the location of heavy hitters. A two-sided Count-Sketch means applying Count-Sketch twice. First we apply Count-Sketch on the rows, hashing n rows into Θ(k poly(log n)) different buckets. With a large probability the heavy rows are isolated in different buckets and therefore their l2 norms are preserved. Next we apply Count-Sketch on the columns, allowing us to estimate the heavy entries in each row-bucket. The resulting matrix will have a much smaller dimension than the original matrix. Identification of heavy hitters is built upon Count-Sketch matrices and bit-testing matrices. We further incorporate error-correcting codes to reduce the failure probability. We also use a Johnson-Lindenstrauss matrix to estimate the l2 norms of the rows for identification of the heavy rows. Bachelor of Science in Mathematical Sciences 2022-04-28T11:22:31Z 2022-04-28T11:22:31Z 2022 Final Year Project (FYP) Zhou, Z. (2022). Identifying correlated heavy-hitters. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/156923 https://hdl.handle.net/10356/156923 en application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Science::Mathematics |
spellingShingle |
Science::Mathematics Zhou, Ziqi Identifying correlated heavy-hitters |
description |
The heavy hitter problem asks to find the top k most frequent elements in a data stream. This problem has been used in many applications across network data analysis, event mining, etc. Many classical algorithms can only handle one-dimensional data such as Count-Sketch and Count-Min. But in this study, we focus on heavy hitters in two-dimensional data.
Our goal is to identify the location of heavy hitters and estimate their value. In the first part, we use a two-sided Count-Sketch to estimate the value of heavy hitters. In the second part, we use error-correcting codes and hashing matrices to identify the location of heavy hitters.
A two-sided Count-Sketch means applying Count-Sketch twice. First we apply Count-Sketch on the rows, hashing n rows into Θ(k poly(log n)) different buckets. With a large probability the heavy rows are isolated in different buckets and therefore their l2 norms are preserved. Next we apply Count-Sketch on the columns, allowing us to estimate the heavy entries in each row-bucket. The resulting matrix will have a much smaller dimension than the original matrix.
Identification of heavy hitters is built upon Count-Sketch matrices and bit-testing matrices. We further incorporate error-correcting codes to reduce the failure probability. We also use a Johnson-Lindenstrauss matrix to estimate the l2 norms of the rows for identification of the heavy rows. |
author2 |
Li Yi |
author_facet |
Li Yi Zhou, Ziqi |
format |
Final Year Project |
author |
Zhou, Ziqi |
author_sort |
Zhou, Ziqi |
title |
Identifying correlated heavy-hitters |
title_short |
Identifying correlated heavy-hitters |
title_full |
Identifying correlated heavy-hitters |
title_fullStr |
Identifying correlated heavy-hitters |
title_full_unstemmed |
Identifying correlated heavy-hitters |
title_sort |
identifying correlated heavy-hitters |
publisher |
Nanyang Technological University |
publishDate |
2022 |
url |
https://hdl.handle.net/10356/156923 |
_version_ |
1759854349550354432 |