Identifying correlated heavy-hitters
The heavy hitter problem asks to find the top k most frequent elements in a data stream. This problem has been used in many applications across network data analysis, event mining, etc. Many classical algorithms can only handle one-dimensional data such as Count-Sketch and Count-Min. But in this stu...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/156923 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | The heavy hitter problem asks to find the top k most frequent elements in a data stream. This problem has been used in many applications across network data analysis, event mining, etc. Many classical algorithms can only handle one-dimensional data such as Count-Sketch and Count-Min. But in this study, we focus on heavy hitters in two-dimensional data.
Our goal is to identify the location of heavy hitters and estimate their value. In the first part, we use a two-sided Count-Sketch to estimate the value of heavy hitters. In the second part, we use error-correcting codes and hashing matrices to identify the location of heavy hitters.
A two-sided Count-Sketch means applying Count-Sketch twice. First we apply Count-Sketch on the rows, hashing n rows into Θ(k poly(log n)) different buckets. With a large probability the heavy rows are isolated in different buckets and therefore their l2 norms are preserved. Next we apply Count-Sketch on the columns, allowing us to estimate the heavy entries in each row-bucket. The resulting matrix will have a much smaller dimension than the original matrix.
Identification of heavy hitters is built upon Count-Sketch matrices and bit-testing matrices. We further incorporate error-correcting codes to reduce the failure probability. We also use a Johnson-Lindenstrauss matrix to estimate the l2 norms of the rows for identification of the heavy rows. |
---|