Identifying correlated heavy-hitters

The heavy hitter problem asks to find the top k most frequent elements in a data stream. This problem has been used in many applications across network data analysis, event mining, etc. Many classical algorithms can only handle one-dimensional data such as Count-Sketch and Count-Min. But in this stu...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Zhou, Ziqi
مؤلفون آخرون:	Li Yi
التنسيق:	Final Year Project
اللغة:	English
منشور في:	Nanyang Technological University 2022
الموضوعات:	Science::Mathematics
الوصول للمادة أونلاين:	https://hdl.handle.net/10356/156923
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
المؤسسة:	Nanyang Technological University
اللغة:	English

الوصف
الملخص:	The heavy hitter problem asks to find the top k most frequent elements in a data stream. This problem has been used in many applications across network data analysis, event mining, etc. Many classical algorithms can only handle one-dimensional data such as Count-Sketch and Count-Min. But in this study, we focus on heavy hitters in two-dimensional data. Our goal is to identify the location of heavy hitters and estimate their value. In the first part, we use a two-sided Count-Sketch to estimate the value of heavy hitters. In the second part, we use error-correcting codes and hashing matrices to identify the location of heavy hitters. A two-sided Count-Sketch means applying Count-Sketch twice. First we apply Count-Sketch on the rows, hashing n rows into Θ(k poly(log n)) different buckets. With a large probability the heavy rows are isolated in different buckets and therefore their l2 norms are preserved. Next we apply Count-Sketch on the columns, allowing us to estimate the heavy entries in each row-bucket. The resulting matrix will have a much smaller dimension than the original matrix. Identification of heavy hitters is built upon Count-Sketch matrices and bit-testing matrices. We further incorporate error-correcting codes to reduce the failure probability. We also use a Johnson-Lindenstrauss matrix to estimate the l2 norms of the rows for identification of the heavy rows.

Identifying correlated heavy-hitters

مواد مشابهة