Data provenance via differential auditing

With the rising awareness of data assets, data governance, which is to understand where data comes from, how it is collected, and how it is used, has been assuming evergrowing importance. One critical component of data governance gaining increasing attention is auditing machine learning models to de...

Full description

Saved in:
Bibliographic Details
Main Authors: MU, Xin, PANG, Ming, ZHU, Feida
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/7808
https://ink.library.smu.edu.sg/context/sis_research/article/8811/viewcontent/2209.01538.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-8811
record_format dspace
spelling sg-smu-ink.sis_research-88112023-12-12T05:35:37Z Data provenance via differential auditing MU, Xin PANG, Ming ZHU, Feida With the rising awareness of data assets, data governance, which is to understand where data comes from, how it is collected, and how it is used, has been assuming evergrowing importance. One critical component of data governance gaining increasing attention is auditing machine learning models to determine if specific data has been used for training. Existing auditing techniques, like shadow auditing methods, have shown feasibility under specific conditions such as having access to label information and knowledge of training protocols. However, these conditions are often not met in most real-world applications. In this paper, we introduce a practical framework for auditing data provenance based on a differential mechanism, i.e., after carefully designed transformation, perturbed input data from the target model's training set would result in much more drastic changes in the output than those from the model's non-training set. Our framework is data-dependent and does not require distinguishing training data from non-training data or training additional shadow models with labeled output data. Furthermore, our framework extends beyond point-based data auditing to group-based data auditing, aligning with the needs of real-world applications. Our theoretical analysis of the differential mechanism and the experimental results on real-world data sets verify the proposal's effectiveness. The codes have been uploaded in an anonymous link. 2023-11-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/7808 info:doi/10.1109/TKDE.2023.3334821 https://ink.library.smu.edu.sg/context/sis_research/article/8811/viewcontent/2209.01538.pdf http://creativecommons.org/licenses/by/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Data models training data biological system modeling computational modeling predictive models machine learning Databases and Information Systems Data Storage Systems Numerical Analysis and Scientific Computing
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Data models
training data
biological system modeling
computational modeling
predictive models
machine learning
Databases and Information Systems
Data Storage Systems
Numerical Analysis and Scientific Computing
spellingShingle Data models
training data
biological system modeling
computational modeling
predictive models
machine learning
Databases and Information Systems
Data Storage Systems
Numerical Analysis and Scientific Computing
MU, Xin
PANG, Ming
ZHU, Feida
Data provenance via differential auditing
description With the rising awareness of data assets, data governance, which is to understand where data comes from, how it is collected, and how it is used, has been assuming evergrowing importance. One critical component of data governance gaining increasing attention is auditing machine learning models to determine if specific data has been used for training. Existing auditing techniques, like shadow auditing methods, have shown feasibility under specific conditions such as having access to label information and knowledge of training protocols. However, these conditions are often not met in most real-world applications. In this paper, we introduce a practical framework for auditing data provenance based on a differential mechanism, i.e., after carefully designed transformation, perturbed input data from the target model's training set would result in much more drastic changes in the output than those from the model's non-training set. Our framework is data-dependent and does not require distinguishing training data from non-training data or training additional shadow models with labeled output data. Furthermore, our framework extends beyond point-based data auditing to group-based data auditing, aligning with the needs of real-world applications. Our theoretical analysis of the differential mechanism and the experimental results on real-world data sets verify the proposal's effectiveness. The codes have been uploaded in an anonymous link.
format text
author MU, Xin
PANG, Ming
ZHU, Feida
author_facet MU, Xin
PANG, Ming
ZHU, Feida
author_sort MU, Xin
title Data provenance via differential auditing
title_short Data provenance via differential auditing
title_full Data provenance via differential auditing
title_fullStr Data provenance via differential auditing
title_full_unstemmed Data provenance via differential auditing
title_sort data provenance via differential auditing
publisher Institutional Knowledge at Singapore Management University
publishDate 2023
url https://ink.library.smu.edu.sg/sis_research/7808
https://ink.library.smu.edu.sg/context/sis_research/article/8811/viewcontent/2209.01538.pdf
_version_ 1787136835057614848