Accelerated big data analysis with deep generative models

Over the past decade, the growth of data has been phenomenal. The amount of data which the world accumulates is expected to increase from 4.4 zettabytes today to 44 zettabytes (44 trillion gigabytes) by the end of the year and with more people getting access to the internet as well as smart devices,...

Full description

Saved in:
Bibliographic Details
Main Author: Tan, Liang Wei
Other Authors: Gao CONG
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/140453
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Over the past decade, the growth of data has been phenomenal. The amount of data which the world accumulates is expected to increase from 4.4 zettabytes today to 44 zettabytes (44 trillion gigabytes) by the end of the year and with more people getting access to the internet as well as smart devices, this rate of growth will continue to skyrocket in the near future. This exponential rate of data growth is creating challenges in data analytics. Traditional computing tools for data analytics which process queries by running through the entire database such as Excel, SQL databases, or Hadoop, take too much time to evaluate statistical queries for large datasets. We need techniques which are much faster. In this project, we propose solutions that involve having a model learn about a target dataset and then, generate a small dataset which has similar statistical properties to the target dataset. We call this small representation dataset a mini dataset. Queries computed on the mini dataset give results which are almost identical to the results obtained by computing on the target dataset. However, because the mini dataset has a much smaller memory footprint, computation times are much shorter. It turns out that Deep Generative Models do exactly what we need. In this work, we use two state-of-the-art Deep Generative Models, Normalizing Flows (more focus on this) [1] and Variational Auto-encoders [2] to demonstrate this. We start by explaining how these models can be used to learn any given data distribution and generate mini datasets resembling the learnt data distribution. Then, we show and discuss the experimental results obtained by testing our techniques on real datasets. Finally, we compare the advantages and disadvantages of our proposed techniques with other state-of-the-art techniques in Approximate Query Processing (AQP).