Data-efficient and privacy-enhanced knowledge discovery

Neural networks have undergone rapid development over the past decade, with the application of AI empowering various industrial chains. Advancing AI techniques gradually becomes a consensus in both scientific and industrial communities. Numerous new and large-scale models are emerging daily, and con...

Full description

Saved in:
Bibliographic Details
Main Author: Shen, Jiyuan
Other Authors: Lam Kwok Yan
Format: Thesis-Master by Research
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/180955
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-180955
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
Artificial intelligence
Privacy-enhanced machine learning
spellingShingle Computer and Information Science
Artificial intelligence
Privacy-enhanced machine learning
Shen, Jiyuan
Data-efficient and privacy-enhanced knowledge discovery
description Neural networks have undergone rapid development over the past decade, with the application of AI empowering various industrial chains. Advancing AI techniques gradually becomes a consensus in both scientific and industrial communities. Numerous new and large-scale models are emerging daily, and concurrently, this trend underscores a pressing issue: the cost of training a competent neural network model has proportionally skyrocketed. According to the latest report by OpenAI, the computational cost to achieve state-of-the-art performance in deep learning doubles every 3.4 months, whereas GPU computational power doubles only every 21.4 months, which indicates a significantly slower pace. Consequently, enhancing deep learning performance by merely increasing hardware consumption is unsustainable. On the other hand, current large-scale models are predominantly data-driven, compelling researchers to amass extensive training data. This often involves scraping nearly the entire internet for text, audio, image or video information to conduct unsupervised pre-training of large models. Take, for instance, the training of a large language model like GPT-3, which necessitates processing billions of web pages to achieve human-like text generation capabilities. Although this enormous data input empowers the model to produce highly accurate and nuanced outputs, it heightens the potential for embedding and spreading biases and personal privacy information contained in the source material, thus elevating serious privacy leakage and ethical concerns. Given the scope of such data collection, how to perform knowledge discovery through an effective and private method in machine learning presents a profound and unavoidable challenge. Regarding this, the thesis investigates relevant and advanced knowledge discovery methods from privacy enhancement and data efficiency perspectives, thus proposing a series of effective techniques and approaches. Firstly, we explore the promising direction of dataset distillation (DD). DD aims to synthesize an informative and compressed dataset through a learning method, which can train neural networks from scratch without significant performance drop compared to the original large datasets. The distilled datasets distil denser and richer information, enabling rapid and efficient model training. Secondly, we apply the discovered knowledge within privacy and efficiency principles in practical industrial contexts, namely Internet-of-Things (IoT) Networks. IoT devices are often susceptible to cyber attacks due to their open deployment environment and limited computing capabilities for stringent security controls. We help prevent IoT sensors from cyber attacks by introducing Federated Learning with ensemble knowledge distillation (FLEKD) to collaboratively train a decentralized shared intrusion detection system (IDS) model without exposing the clients' training data. Finally, there are scenarios where the model needs to efficiently unlearn acquired knowledge. For instance, under regulations like GDPR and CCPA, individuals have the right to request the deletion of their data and any knowledge derived from it in existing models. Additionally, there may be instances where the training data contains malicious content or harmful information that is identified only after the pretraining process, leading the main model to integrate this undesirable knowledge. To address these issues effectively, we introduce an efficient framework, named Starfish, designed to facilitate swift and effective unlearning processes. In conclusion, this thesis explores data-efficient and privacy-enhanced knowledge discovery methods, covering aspects such as distillation of dataset scale or model size, efficient federated learning mechanisms with knowledge distillation, and rapid and certified unlearning acquired knowledge. All proposed frameworks have been thoroughly validated across various datasets and bolstered by theoretical proofs. Through this work, we hope the research community to intensify more attention on enhancing the efficiency and security of knowledge discovery, collectively propelling the advancement of next-generation AI technologies.
author2 Lam Kwok Yan
author_facet Lam Kwok Yan
Shen, Jiyuan
format Thesis-Master by Research
author Shen, Jiyuan
author_sort Shen, Jiyuan
title Data-efficient and privacy-enhanced knowledge discovery
title_short Data-efficient and privacy-enhanced knowledge discovery
title_full Data-efficient and privacy-enhanced knowledge discovery
title_fullStr Data-efficient and privacy-enhanced knowledge discovery
title_full_unstemmed Data-efficient and privacy-enhanced knowledge discovery
title_sort data-efficient and privacy-enhanced knowledge discovery
publisher Nanyang Technological University
publishDate 2024
url https://hdl.handle.net/10356/180955
_version_ 1816858925003177984
spelling sg-ntu-dr.10356-1809552024-11-06T00:45:25Z Data-efficient and privacy-enhanced knowledge discovery Shen, Jiyuan Lam Kwok Yan College of Computing and Data Science Strategic Centre for Research in Privacy-Preserving Technologies & Systems (SCRIPTS) kwokyan.lam@ntu.edu.sg Computer and Information Science Artificial intelligence Privacy-enhanced machine learning Neural networks have undergone rapid development over the past decade, with the application of AI empowering various industrial chains. Advancing AI techniques gradually becomes a consensus in both scientific and industrial communities. Numerous new and large-scale models are emerging daily, and concurrently, this trend underscores a pressing issue: the cost of training a competent neural network model has proportionally skyrocketed. According to the latest report by OpenAI, the computational cost to achieve state-of-the-art performance in deep learning doubles every 3.4 months, whereas GPU computational power doubles only every 21.4 months, which indicates a significantly slower pace. Consequently, enhancing deep learning performance by merely increasing hardware consumption is unsustainable. On the other hand, current large-scale models are predominantly data-driven, compelling researchers to amass extensive training data. This often involves scraping nearly the entire internet for text, audio, image or video information to conduct unsupervised pre-training of large models. Take, for instance, the training of a large language model like GPT-3, which necessitates processing billions of web pages to achieve human-like text generation capabilities. Although this enormous data input empowers the model to produce highly accurate and nuanced outputs, it heightens the potential for embedding and spreading biases and personal privacy information contained in the source material, thus elevating serious privacy leakage and ethical concerns. Given the scope of such data collection, how to perform knowledge discovery through an effective and private method in machine learning presents a profound and unavoidable challenge. Regarding this, the thesis investigates relevant and advanced knowledge discovery methods from privacy enhancement and data efficiency perspectives, thus proposing a series of effective techniques and approaches. Firstly, we explore the promising direction of dataset distillation (DD). DD aims to synthesize an informative and compressed dataset through a learning method, which can train neural networks from scratch without significant performance drop compared to the original large datasets. The distilled datasets distil denser and richer information, enabling rapid and efficient model training. Secondly, we apply the discovered knowledge within privacy and efficiency principles in practical industrial contexts, namely Internet-of-Things (IoT) Networks. IoT devices are often susceptible to cyber attacks due to their open deployment environment and limited computing capabilities for stringent security controls. We help prevent IoT sensors from cyber attacks by introducing Federated Learning with ensemble knowledge distillation (FLEKD) to collaboratively train a decentralized shared intrusion detection system (IDS) model without exposing the clients' training data. Finally, there are scenarios where the model needs to efficiently unlearn acquired knowledge. For instance, under regulations like GDPR and CCPA, individuals have the right to request the deletion of their data and any knowledge derived from it in existing models. Additionally, there may be instances where the training data contains malicious content or harmful information that is identified only after the pretraining process, leading the main model to integrate this undesirable knowledge. To address these issues effectively, we introduce an efficient framework, named Starfish, designed to facilitate swift and effective unlearning processes. In conclusion, this thesis explores data-efficient and privacy-enhanced knowledge discovery methods, covering aspects such as distillation of dataset scale or model size, efficient federated learning mechanisms with knowledge distillation, and rapid and certified unlearning acquired knowledge. All proposed frameworks have been thoroughly validated across various datasets and bolstered by theoretical proofs. Through this work, we hope the research community to intensify more attention on enhancing the efficiency and security of knowledge discovery, collectively propelling the advancement of next-generation AI technologies. Master's degree 2024-11-06T00:45:25Z 2024-11-06T00:45:25Z 2024 Thesis-Master by Research Shen, J. (2024). Data-efficient and privacy-enhanced knowledge discovery. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/180955 https://hdl.handle.net/10356/180955 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University