Data-efficient and privacy-enhanced knowledge discovery

Neural networks have undergone rapid development over the past decade, with the application of AI empowering various industrial chains. Advancing AI techniques gradually becomes a consensus in both scientific and industrial communities. Numerous new and large-scale models are emerging daily, and con...

Full description

Saved in:

Bibliographic Details
Main Author:	Shen, Jiyuan
Other Authors:	Lam Kwok Yan
Format:	Thesis-Master by Research
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science Artificial intelligence Privacy-enhanced machine learning
Online Access:	https://hdl.handle.net/10356/180955
Tags:	Add Tag No Tags, Be the first to tag this record!

id	sg-ntu-dr.10356-180955
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science Artificial intelligence Privacy-enhanced machine learning
spellingShingle	Computer and Information Science Artificial intelligence Privacy-enhanced machine learning Shen, Jiyuan Data-efficient and privacy-enhanced knowledge discovery
description	Neural networks have undergone rapid development over the past decade, with the application of AI empowering various industrial chains. Advancing AI techniques gradually becomes a consensus in both scientific and industrial communities. Numerous new and large-scale models are emerging daily, and concurrently, this trend underscores a pressing issue: the cost of training a competent neural network model has proportionally skyrocketed. According to the latest report by OpenAI, the computational cost to achieve state-of-the-art performance in deep learning doubles every 3.4 months, whereas GPU computational power doubles only every 21.4 months, which indicates a significantly slower pace. Consequently, enhancing deep learning performance by merely increasing hardware consumption is unsustainable. On the other hand, current large-scale models are predominantly data-driven, compelling researchers to amass extensive training data. This often involves scraping nearly the entire internet for text, audio, image or video information to conduct unsupervised pre-training of large models. Take, for instance, the training of a large language model like GPT-3, which necessitates processing billions of web pages to achieve human-like text generation capabilities. Although this enormous data input empowers the model to produce highly accurate and nuanced outputs, it heightens the potential for embedding and spreading biases and personal privacy information contained in the source material, thus elevating serious privacy leakage and ethical concerns. Given the scope of such data collection, how to perform knowledge discovery through an effective and private method in machine learning presents a profound and unavoidable challenge. Regarding this, the thesis investigates relevant and advanced knowledge discovery methods from privacy enhancement and data efficiency perspectives, thus proposing a series of effective techniques and approaches. Firstly, we explore the promising direction of dataset distillation (DD). DD aims to synthesize an informative and compressed dataset through a learning method, which can train neural networks from scratch without significant performance drop compared to the original large datasets. The distilled datasets distil denser and richer information, enabling rapid and efficient model training. Secondly, we apply the discovered knowledge within privacy and efficiency principles in practical industrial contexts, namely Internet-of-Things (IoT) Networks. IoT devices are often susceptible to cyber attacks due to their open deployment environment and limited computing capabilities for stringent security controls. We help prevent IoT sensors from cyber attacks by introducing Federated Learning with ensemble knowledge distillation (FLEKD) to collaboratively train a decentralized shared intrusion detection system (IDS) model without exposing the clients' training data. Finally, there are scenarios where the model needs to efficiently unlearn acquired knowledge. For instance, under regulations like GDPR and CCPA, individuals have the right to request the deletion of their data and any knowledge derived from it in existing models. Additionally, there may be instances where the training data contains malicious content or harmful information that is identified only after the pretraining process, leading the main model to integrate this undesirable knowledge. To address these issues effectively, we introduce an efficient framework, named Starfish, designed to facilitate swift and effective unlearning processes. In conclusion, this thesis explores data-efficient and privacy-enhanced knowledge discovery methods, covering aspects such as distillation of dataset scale or model size, efficient federated learning mechanisms with knowledge distillation, and rapid and certified unlearning acquired knowledge. All proposed frameworks have been thoroughly validated across various datasets and bolstered by theoretical proofs. Through this work, we hope the research community to intensify more attention on enhancing the efficiency and security of knowledge discovery, collectively propelling the advancement of next-generation AI technologies.
author2	Lam Kwok Yan
author_facet	Lam Kwok Yan Shen, Jiyuan
format	Thesis-Master by Research
author	Shen, Jiyuan
author_sort	Shen, Jiyuan
title	Data-efficient and privacy-enhanced knowledge discovery
title_short	Data-efficient and privacy-enhanced knowledge discovery
title_full	Data-efficient and privacy-enhanced knowledge discovery
title_fullStr	Data-efficient and privacy-enhanced knowledge discovery
title_full_unstemmed	Data-efficient and privacy-enhanced knowledge discovery
title_sort	data-efficient and privacy-enhanced knowledge discovery
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/180955
_version_	1819112931478994944
spelling	sg-ntu-dr.10356-1809552024-12-03T05:20:50Z Data-efficient and privacy-enhanced knowledge discovery Shen, Jiyuan Lam Kwok Yan College of Computing and Data Science Strategic Centre for Research in Privacy-Preserving Technologies & Systems (SCRIPTS) kwokyan.lam@ntu.edu.sg Computer and Information Science Artificial intelligence Privacy-enhanced machine learning Neural networks have undergone rapid development over the past decade, with the application of AI empowering various industrial chains. Advancing AI techniques gradually becomes a consensus in both scientific and industrial communities. Numerous new and large-scale models are emerging daily, and concurrently, this trend underscores a pressing issue: the cost of training a competent neural network model has proportionally skyrocketed. According to the latest report by OpenAI, the computational cost to achieve state-of-the-art performance in deep learning doubles every 3.4 months, whereas GPU computational power doubles only every 21.4 months, which indicates a significantly slower pace. Consequently, enhancing deep learning performance by merely increasing hardware consumption is unsustainable. On the other hand, current large-scale models are predominantly data-driven, compelling researchers to amass extensive training data. This often involves scraping nearly the entire internet for text, audio, image or video information to conduct unsupervised pre-training of large models. Take, for instance, the training of a large language model like GPT-3, which necessitates processing billions of web pages to achieve human-like text generation capabilities. Although this enormous data input empowers the model to produce highly accurate and nuanced outputs, it heightens the potential for embedding and spreading biases and personal privacy information contained in the source material, thus elevating serious privacy leakage and ethical concerns. Given the scope of such data collection, how to perform knowledge discovery through an effective and private method in machine learning presents a profound and unavoidable challenge. Regarding this, the thesis investigates relevant and advanced knowledge discovery methods from privacy enhancement and data efficiency perspectives, thus proposing a series of effective techniques and approaches. Firstly, we explore the promising direction of dataset distillation (DD). DD aims to synthesize an informative and compressed dataset through a learning method, which can train neural networks from scratch without significant performance drop compared to the original large datasets. The distilled datasets distil denser and richer information, enabling rapid and efficient model training. Secondly, we apply the discovered knowledge within privacy and efficiency principles in practical industrial contexts, namely Internet-of-Things (IoT) Networks. IoT devices are often susceptible to cyber attacks due to their open deployment environment and limited computing capabilities for stringent security controls. We help prevent IoT sensors from cyber attacks by introducing Federated Learning with ensemble knowledge distillation (FLEKD) to collaboratively train a decentralized shared intrusion detection system (IDS) model without exposing the clients' training data. Finally, there are scenarios where the model needs to efficiently unlearn acquired knowledge. For instance, under regulations like GDPR and CCPA, individuals have the right to request the deletion of their data and any knowledge derived from it in existing models. Additionally, there may be instances where the training data contains malicious content or harmful information that is identified only after the pretraining process, leading the main model to integrate this undesirable knowledge. To address these issues effectively, we introduce an efficient framework, named Starfish, designed to facilitate swift and effective unlearning processes. In conclusion, this thesis explores data-efficient and privacy-enhanced knowledge discovery methods, covering aspects such as distillation of dataset scale or model size, efficient federated learning mechanisms with knowledge distillation, and rapid and certified unlearning acquired knowledge. All proposed frameworks have been thoroughly validated across various datasets and bolstered by theoretical proofs. Through this work, we hope the research community to intensify more attention on enhancing the efficiency and security of knowledge discovery, collectively propelling the advancement of next-generation AI technologies. Master's degree 2024-11-06T00:45:25Z 2024-11-06T00:45:25Z 2024 Thesis-Master by Research Shen, J. (2024). Data-efficient and privacy-enhanced knowledge discovery. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/180955 https://hdl.handle.net/10356/180955 10.32657/10356/180955 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Data-efficient and privacy-enhanced knowledge discovery

Similar Items