Data-efficient and privacy-enhanced knowledge discovery
Neural networks have undergone rapid development over the past decade, with the application of AI empowering various industrial chains. Advancing AI techniques gradually becomes a consensus in both scientific and industrial communities. Numerous new and large-scale models are emerging daily, and con...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Master by Research |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/180955 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-180955 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science Artificial intelligence Privacy-enhanced machine learning |
spellingShingle |
Computer and Information Science Artificial intelligence Privacy-enhanced machine learning Shen, Jiyuan Data-efficient and privacy-enhanced knowledge discovery |
description |
Neural networks have undergone rapid development over the past decade, with the application of AI empowering various industrial chains. Advancing AI techniques gradually becomes a consensus in both scientific and industrial communities. Numerous new and large-scale models are emerging daily, and concurrently, this trend underscores a pressing issue: the cost of training a competent neural network model has proportionally skyrocketed. According to the latest report by OpenAI, the computational cost to achieve state-of-the-art performance in deep learning doubles every 3.4 months, whereas GPU computational power doubles only every 21.4 months, which indicates a significantly slower pace. Consequently, enhancing deep learning performance by merely increasing hardware consumption is unsustainable. On the other hand, current large-scale models are predominantly data-driven, compelling researchers to amass extensive training data. This often involves scraping nearly the entire internet for text, audio, image or video information to conduct unsupervised pre-training of large models. Take, for instance, the training of a large language model like GPT-3, which necessitates processing billions of web pages to achieve human-like text generation capabilities. Although this enormous data input empowers the model to produce highly accurate and nuanced outputs, it heightens the potential for embedding and spreading biases and personal privacy information contained in the source material, thus elevating serious privacy leakage and ethical concerns. Given the scope of such data collection, how to perform knowledge discovery through an effective and private method in machine learning presents a profound and unavoidable challenge.
Regarding this, the thesis investigates relevant and advanced knowledge discovery methods from privacy enhancement and data efficiency perspectives, thus proposing a series of effective techniques and approaches. Firstly, we explore the promising direction of dataset distillation (DD). DD aims to synthesize an informative and compressed dataset through a learning method, which can train neural networks from scratch without significant performance drop compared to the original large datasets. The distilled datasets distil denser and richer information, enabling rapid and efficient model training.
Secondly, we apply the discovered knowledge within privacy and efficiency principles in practical industrial contexts, namely Internet-of-Things (IoT) Networks. IoT devices are often susceptible to cyber attacks due to their open deployment environment and limited computing capabilities for stringent security controls. We help prevent IoT sensors from cyber attacks by introducing Federated Learning with ensemble knowledge distillation (FLEKD) to collaboratively train a decentralized shared intrusion detection system (IDS) model without exposing the clients' training data.
Finally, there are scenarios where the model needs to efficiently unlearn acquired knowledge. For instance, under regulations like GDPR and CCPA, individuals have the right to request the deletion of their data and any knowledge derived from it in existing models. Additionally, there may be instances where the training data contains malicious content or harmful information that is identified only after the pretraining process, leading the main model to integrate this undesirable knowledge. To address these issues effectively, we introduce an efficient framework, named Starfish, designed to facilitate swift and effective unlearning processes.
In conclusion, this thesis explores data-efficient and privacy-enhanced knowledge discovery methods, covering aspects such as distillation of dataset scale or model size, efficient federated learning mechanisms with knowledge distillation, and rapid and certified unlearning acquired knowledge. All proposed frameworks have been thoroughly validated across various datasets and bolstered by theoretical proofs. Through this work, we hope the research community to intensify more attention on enhancing the efficiency and security of knowledge discovery, collectively propelling the advancement of next-generation AI technologies. |
author2 |
Lam Kwok Yan |
author_facet |
Lam Kwok Yan Shen, Jiyuan |
format |
Thesis-Master by Research |
author |
Shen, Jiyuan |
author_sort |
Shen, Jiyuan |
title |
Data-efficient and privacy-enhanced knowledge discovery |
title_short |
Data-efficient and privacy-enhanced knowledge discovery |
title_full |
Data-efficient and privacy-enhanced knowledge discovery |
title_fullStr |
Data-efficient and privacy-enhanced knowledge discovery |
title_full_unstemmed |
Data-efficient and privacy-enhanced knowledge discovery |
title_sort |
data-efficient and privacy-enhanced knowledge discovery |
publisher |
Nanyang Technological University |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/180955 |
_version_ |
1819112931478994944 |
spelling |
sg-ntu-dr.10356-1809552024-12-03T05:20:50Z Data-efficient and privacy-enhanced knowledge discovery Shen, Jiyuan Lam Kwok Yan College of Computing and Data Science Strategic Centre for Research in Privacy-Preserving Technologies & Systems (SCRIPTS) kwokyan.lam@ntu.edu.sg Computer and Information Science Artificial intelligence Privacy-enhanced machine learning Neural networks have undergone rapid development over the past decade, with the application of AI empowering various industrial chains. Advancing AI techniques gradually becomes a consensus in both scientific and industrial communities. Numerous new and large-scale models are emerging daily, and concurrently, this trend underscores a pressing issue: the cost of training a competent neural network model has proportionally skyrocketed. According to the latest report by OpenAI, the computational cost to achieve state-of-the-art performance in deep learning doubles every 3.4 months, whereas GPU computational power doubles only every 21.4 months, which indicates a significantly slower pace. Consequently, enhancing deep learning performance by merely increasing hardware consumption is unsustainable. On the other hand, current large-scale models are predominantly data-driven, compelling researchers to amass extensive training data. This often involves scraping nearly the entire internet for text, audio, image or video information to conduct unsupervised pre-training of large models. Take, for instance, the training of a large language model like GPT-3, which necessitates processing billions of web pages to achieve human-like text generation capabilities. Although this enormous data input empowers the model to produce highly accurate and nuanced outputs, it heightens the potential for embedding and spreading biases and personal privacy information contained in the source material, thus elevating serious privacy leakage and ethical concerns. Given the scope of such data collection, how to perform knowledge discovery through an effective and private method in machine learning presents a profound and unavoidable challenge. Regarding this, the thesis investigates relevant and advanced knowledge discovery methods from privacy enhancement and data efficiency perspectives, thus proposing a series of effective techniques and approaches. Firstly, we explore the promising direction of dataset distillation (DD). DD aims to synthesize an informative and compressed dataset through a learning method, which can train neural networks from scratch without significant performance drop compared to the original large datasets. The distilled datasets distil denser and richer information, enabling rapid and efficient model training. Secondly, we apply the discovered knowledge within privacy and efficiency principles in practical industrial contexts, namely Internet-of-Things (IoT) Networks. IoT devices are often susceptible to cyber attacks due to their open deployment environment and limited computing capabilities for stringent security controls. We help prevent IoT sensors from cyber attacks by introducing Federated Learning with ensemble knowledge distillation (FLEKD) to collaboratively train a decentralized shared intrusion detection system (IDS) model without exposing the clients' training data. Finally, there are scenarios where the model needs to efficiently unlearn acquired knowledge. For instance, under regulations like GDPR and CCPA, individuals have the right to request the deletion of their data and any knowledge derived from it in existing models. Additionally, there may be instances where the training data contains malicious content or harmful information that is identified only after the pretraining process, leading the main model to integrate this undesirable knowledge. To address these issues effectively, we introduce an efficient framework, named Starfish, designed to facilitate swift and effective unlearning processes. In conclusion, this thesis explores data-efficient and privacy-enhanced knowledge discovery methods, covering aspects such as distillation of dataset scale or model size, efficient federated learning mechanisms with knowledge distillation, and rapid and certified unlearning acquired knowledge. All proposed frameworks have been thoroughly validated across various datasets and bolstered by theoretical proofs. Through this work, we hope the research community to intensify more attention on enhancing the efficiency and security of knowledge discovery, collectively propelling the advancement of next-generation AI technologies. Master's degree 2024-11-06T00:45:25Z 2024-11-06T00:45:25Z 2024 Thesis-Master by Research Shen, J. (2024). Data-efficient and privacy-enhanced knowledge discovery. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/180955 https://hdl.handle.net/10356/180955 10.32657/10356/180955 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |