Accelerated big data analysis with deep generative models
Over the past decade, the growth of data has been phenomenal. The amount of data which the world accumulates is expected to increase from 4.4 zettabytes today to 44 zettabytes (44 trillion gigabytes) by the end of the year and with more people getting access to the internet as well as smart devices,...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2020
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/140453 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Over the past decade, the growth of data has been phenomenal. The amount of data which the world accumulates is expected to increase from 4.4 zettabytes today to 44 zettabytes (44 trillion gigabytes) by the end of the year and with more people getting access to the internet as well as smart devices, this rate of growth will continue to skyrocket in the near future. This exponential rate of data growth is creating challenges in data analytics. Traditional computing tools for data analytics which process queries by running through the entire database such as Excel, SQL databases, or Hadoop, take too much time to evaluate statistical queries for large datasets. We need techniques which are much faster. In this project, we propose solutions that involve having a model learn about a target dataset and then, generate a small dataset which has similar statistical properties to the target dataset. We call this small representation dataset a mini dataset. Queries computed on the mini dataset give results which are almost identical to the results obtained by computing on the target dataset. However, because the mini dataset has a much smaller memory footprint, computation times are much shorter. It turns out that Deep Generative Models do exactly what we need. In this work, we use two state-of-the-art Deep Generative Models, Normalizing Flows (more focus on this) [1] and Variational Auto-encoders [2] to demonstrate this. We start by explaining how these models can be used to learn any given data distribution and generate mini datasets resembling the learnt data distribution. Then, we show and discuss the experimental results obtained by testing our techniques on real datasets. Finally, we compare the advantages and disadvantages of our proposed techniques with other state-of-the-art techniques in Approximate Query Processing (AQP). |
---|