Efficient and privacy-preserving feature importance-based vertical federated learning

Vertical Federated Learning (VFL) enables multiple data owners, each holding a different subset of features about a largely overlapping set of data samples, to collaboratively train a global model. The quality of data owners’ local features affects the performance of the VFL model, which makes featu...

Full description

Saved in:
Bibliographic Details
Main Authors: Li, Anran, Huang, Jiahui, Jia, Ju, Peng, Hongyi, Zhang, Lan, Tuan, Luu Anh, Yu, Han, Li, Xiang-Yang
Other Authors: College of Computing and Data Science
Format: Article
Language:English
Published: 2024
Subjects:
Online Access:https://hdl.handle.net/10356/179064
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Vertical Federated Learning (VFL) enables multiple data owners, each holding a different subset of features about a largely overlapping set of data samples, to collaboratively train a global model. The quality of data owners’ local features affects the performance of the VFL model, which makes feature selection vitally important. However, existing feature selection methods for VFL either assume the availability of prior knowledge on the number of noisy features or prior knowledge on the post-training threshold of useful features to be selected, making them unsuitable for practical applications. To bridge this gap, we propose the Federated Stochastic Dual-Gate based Feature Selection (FedSDG-FS) approach. It consists of a Gaussian stochastic dual-gate to efficiently approximate the probability of a feature being selected. FedSDG-FS further designs a local embedding perturbation approach to achieve differential privacy for local training data. To reduce overhead, we propose a feature importance initialization method based on Gini impurity, which can accomplish its goals with only two parameter transmissions between the server and the clients. The enhanced version, FedSDG-FS++, protects the privacy for both the clients’ training data and the server's labels through Partially Homomorphic Encryption (PHE) without relying on a trusted third-party. Theoretically, we analyze the convergence rate, privacy guarantees and security analysis of our methods. Extensive experiments on both synthetic and real-world datasets show that FedSDG-FS and FedSDG-FS++ significantly outperform existing approaches in terms of achieving more accurate selection of high-quality features as well as improving VFL performance in a privacy-preserving manner.