Privacy-preserving data analytics

Massive volumes of sensitive information are being collected for data analytics and machine learning, such as large scale Internet of Things (IoT) data. Some IoT data contain users’ confidential information, for example, energy consumption or location data. These data may expose a family’s habits an...

Full description

Saved in:
Bibliographic Details
Main Author: Zhao, Yang
Other Authors: Jun Zhao
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/160032
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-160032
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
spellingShingle Engineering::Computer science and engineering
Zhao, Yang
Privacy-preserving data analytics
description Massive volumes of sensitive information are being collected for data analytics and machine learning, such as large scale Internet of Things (IoT) data. Some IoT data contain users’ confidential information, for example, energy consumption or location data. These data may expose a family’s habits and routines that attackers may utilize to perform attacks. The Internet of Vehicles (IoV), a promising branch of IoT, simulates a large variety of crowdsourcing applications such as Waze, Uber, and Amazon Mechanical Turk. These applications report the real-time traffic information to the cloud server, which trains a machine learning model based on traffic information uploaded by intelligent traffic management users. However, crowdsourcing application owners can easily infer users’ location information, traffic information, motor vehicle information, and environmental information, etc., raising severe sensitive personal information privacy concerns. Besides, as the number of vehicles increases, the frequent communication between vehicles and the cloud server incurs a tremendous communication cost. Many countries have strict policies, regulations, and laws on how technology companies collect and process users’ data to protect personal privacy. These companies need to analyze users’ data to improve their service quality. In order to preserve privacy while revealing useful information about datasets, differential privacy (DP) is proposed. Intuitively, the output of a DP mechanism will not change significantly because of the presence or absence of one tuple of a dataset. DP has attracted much interest from both the academia and the industry. For example, Apple has incorporated DP into its mobile operating system iOS; Google has implemented a DP tool called RAPPOR in the Chrome browser to collect information. An increasing amount of users’ sensitive information is now being collected for analytic purposes. Also, DP has been widely studied in the literature to protect the privacy of users’ information. The privacy parameters bound the information about the dataset leaked by the noisy output. Oftentimes, a dataset needs to be used for answering multiple queries, so the level of privacy protection may degrade as more queries are answered. Thus, it is crucial to keep track of privacy budget spending, which should not exceed the given limit of the privacy budget. In particular, we have made the following three major contributions. The first contribution is to integrate federated learning (FL) and local differential privacy (LDP) to facilitate the crowdsourcing applications to obtain the machine learning model to avoid the privacy leakage threat and reduce the communication cost. Specifically, we propose four LDP mechanisms to perturb gradients. The proposed Three-Outputs mechanism introduces three different output possibilities to deliver a high accuracy when the privacy budget is small. The output possibilities of Three-Outputs can be encoded with two bits to reduce the communication cost. Additionally, to maximize the performance when the privacy budget is significant, an optimal piecewise mechanism (PM-OPT) is proposed. We further propose a suboptimal piecewise mechanism (PM-SUB) with a more straightforward formula and comparable utility to the PM-OPT mechanism. Then, we build a novel hybrid mechanism by combining Three-Outputs and PM-SUB mechanisms. Finally, an LDP based FL stochastic gradient descent (LDP-FedSGD) algorithm is proposed to coordinate the cloud server and edge devices to train the machine learning model collaboratively. Applying our proposed LDP algorithms to FL protects private personal information in case adversaries infer sensitive information by reversing engineering uploaded gradients. Also, our proposed LDP algorithms ensure the utility of the gradients for FL. The second contribution is that when a query has been answered before and is asked again on the same dataset, we may reuse the previous noisy response to answer the current query to save the privacy cost. In view of the above, we design an algorithm to reuse previous noisy responses if the same query is asked repeatedly. In particular, considering that different requests of the same query may have different DP requirements, our algorithm sets the optimal fraction from the old noisy responses to reuse and add new noise to minimize the accumulated privacy cost. In order to implement the algorithm, we design and implement a blockchain-based system for tracking and saving DP costs as the bockchain provides a distributed immutable ledger that records each query’s type, the noisy response used to answer each query, the associated noise level added to the true query result, and the remaining privacy budget in our system. As a result, the dataset owner knows how the dataset has been used and be confident that no new privacy cost will be incurred for answering queries once the specified privacy budget is exhausted. The third contribution is to design an FL system leveraging a reputation mechanism to assist home appliance manufacturers in training a machine learning model based on customers’ data to help manufacturers develop a smart home system. Then, manufacturers can predict customers’ requirements and consumption behaviors in the future. The working flow of the system includes two stages: in the first stage, customers train the initial model provided by the manufacturer using both the mobile phone and the mobile edge computing (MEC) server. Customers collect data from various home appliances using phones, and then they download and train the initial model with their local data. After deriving local models, customers sign on their models and send them to the blockchain. If customers or manufacturers are malicious, we use the blockchain to replace the centralized aggregator in the traditional FL system. Since records on the blockchain are untampered, malicious customers’ or manufacturers’ activities are traceable. In the second stage, manufacturers select customers or organizations as miners for calculating the averaged model using received models from customers. By the end of the crowdsourcing task, one of the miners chosen as the temporary leader uploads the model to the blockchain. We enforce DP on the extracted features and propose a new normalization technique to protect customers’ privacy and improve test accuracy. We experimentally demonstrate that our normalization technique outperforms batch normalization when features are under DP protection. In addition, to attract more customers to participate in the crowdsourcing FL task, we design an incentive mechanism to award participants. In summary, this thesis addresses challenging problems faced while conducting privacy-preserving analysis on the data from IoT devices, including designing algorithms to preserve data privacy, managing the differential privacy cost wisely with blockchain, and proposing a normalization technique to improve the accuracy of the FL model. Also, we do extensive experiments by employing publicly available real datasets to confirm that our proposed algorithms and systems are valid. Finally, we list several promising research directions for future work.
author2 Jun Zhao
author_facet Jun Zhao
Zhao, Yang
format Thesis-Doctor of Philosophy
author Zhao, Yang
author_sort Zhao, Yang
title Privacy-preserving data analytics
title_short Privacy-preserving data analytics
title_full Privacy-preserving data analytics
title_fullStr Privacy-preserving data analytics
title_full_unstemmed Privacy-preserving data analytics
title_sort privacy-preserving data analytics
publisher Nanyang Technological University
publishDate 2022
url https://hdl.handle.net/10356/160032
_version_ 1743119513004539904
spelling sg-ntu-dr.10356-1600322022-08-01T05:07:18Z Privacy-preserving data analytics Zhao, Yang Jun Zhao School of Computer Science and Engineering junzhao@ntu.edu.sg Engineering::Computer science and engineering Massive volumes of sensitive information are being collected for data analytics and machine learning, such as large scale Internet of Things (IoT) data. Some IoT data contain users’ confidential information, for example, energy consumption or location data. These data may expose a family’s habits and routines that attackers may utilize to perform attacks. The Internet of Vehicles (IoV), a promising branch of IoT, simulates a large variety of crowdsourcing applications such as Waze, Uber, and Amazon Mechanical Turk. These applications report the real-time traffic information to the cloud server, which trains a machine learning model based on traffic information uploaded by intelligent traffic management users. However, crowdsourcing application owners can easily infer users’ location information, traffic information, motor vehicle information, and environmental information, etc., raising severe sensitive personal information privacy concerns. Besides, as the number of vehicles increases, the frequent communication between vehicles and the cloud server incurs a tremendous communication cost. Many countries have strict policies, regulations, and laws on how technology companies collect and process users’ data to protect personal privacy. These companies need to analyze users’ data to improve their service quality. In order to preserve privacy while revealing useful information about datasets, differential privacy (DP) is proposed. Intuitively, the output of a DP mechanism will not change significantly because of the presence or absence of one tuple of a dataset. DP has attracted much interest from both the academia and the industry. For example, Apple has incorporated DP into its mobile operating system iOS; Google has implemented a DP tool called RAPPOR in the Chrome browser to collect information. An increasing amount of users’ sensitive information is now being collected for analytic purposes. Also, DP has been widely studied in the literature to protect the privacy of users’ information. The privacy parameters bound the information about the dataset leaked by the noisy output. Oftentimes, a dataset needs to be used for answering multiple queries, so the level of privacy protection may degrade as more queries are answered. Thus, it is crucial to keep track of privacy budget spending, which should not exceed the given limit of the privacy budget. In particular, we have made the following three major contributions. The first contribution is to integrate federated learning (FL) and local differential privacy (LDP) to facilitate the crowdsourcing applications to obtain the machine learning model to avoid the privacy leakage threat and reduce the communication cost. Specifically, we propose four LDP mechanisms to perturb gradients. The proposed Three-Outputs mechanism introduces three different output possibilities to deliver a high accuracy when the privacy budget is small. The output possibilities of Three-Outputs can be encoded with two bits to reduce the communication cost. Additionally, to maximize the performance when the privacy budget is significant, an optimal piecewise mechanism (PM-OPT) is proposed. We further propose a suboptimal piecewise mechanism (PM-SUB) with a more straightforward formula and comparable utility to the PM-OPT mechanism. Then, we build a novel hybrid mechanism by combining Three-Outputs and PM-SUB mechanisms. Finally, an LDP based FL stochastic gradient descent (LDP-FedSGD) algorithm is proposed to coordinate the cloud server and edge devices to train the machine learning model collaboratively. Applying our proposed LDP algorithms to FL protects private personal information in case adversaries infer sensitive information by reversing engineering uploaded gradients. Also, our proposed LDP algorithms ensure the utility of the gradients for FL. The second contribution is that when a query has been answered before and is asked again on the same dataset, we may reuse the previous noisy response to answer the current query to save the privacy cost. In view of the above, we design an algorithm to reuse previous noisy responses if the same query is asked repeatedly. In particular, considering that different requests of the same query may have different DP requirements, our algorithm sets the optimal fraction from the old noisy responses to reuse and add new noise to minimize the accumulated privacy cost. In order to implement the algorithm, we design and implement a blockchain-based system for tracking and saving DP costs as the bockchain provides a distributed immutable ledger that records each query’s type, the noisy response used to answer each query, the associated noise level added to the true query result, and the remaining privacy budget in our system. As a result, the dataset owner knows how the dataset has been used and be confident that no new privacy cost will be incurred for answering queries once the specified privacy budget is exhausted. The third contribution is to design an FL system leveraging a reputation mechanism to assist home appliance manufacturers in training a machine learning model based on customers’ data to help manufacturers develop a smart home system. Then, manufacturers can predict customers’ requirements and consumption behaviors in the future. The working flow of the system includes two stages: in the first stage, customers train the initial model provided by the manufacturer using both the mobile phone and the mobile edge computing (MEC) server. Customers collect data from various home appliances using phones, and then they download and train the initial model with their local data. After deriving local models, customers sign on their models and send them to the blockchain. If customers or manufacturers are malicious, we use the blockchain to replace the centralized aggregator in the traditional FL system. Since records on the blockchain are untampered, malicious customers’ or manufacturers’ activities are traceable. In the second stage, manufacturers select customers or organizations as miners for calculating the averaged model using received models from customers. By the end of the crowdsourcing task, one of the miners chosen as the temporary leader uploads the model to the blockchain. We enforce DP on the extracted features and propose a new normalization technique to protect customers’ privacy and improve test accuracy. We experimentally demonstrate that our normalization technique outperforms batch normalization when features are under DP protection. In addition, to attract more customers to participate in the crowdsourcing FL task, we design an incentive mechanism to award participants. In summary, this thesis addresses challenging problems faced while conducting privacy-preserving analysis on the data from IoT devices, including designing algorithms to preserve data privacy, managing the differential privacy cost wisely with blockchain, and proposing a normalization technique to improve the accuracy of the FL model. Also, we do extensive experiments by employing publicly available real datasets to confirm that our proposed algorithms and systems are valid. Finally, we list several promising research directions for future work. Doctor of Philosophy 2022-07-12T02:42:30Z 2022-07-12T02:42:30Z 2022 Thesis-Doctor of Philosophy Zhao, Y. (2022). Privacy-preserving data analytics. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/160032 https://hdl.handle.net/10356/160032 10.32657/10356/160032 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University