Distributed classification with variable distributions
When the data at a location is insufficient, one may apply a naive solution to gather data from other (remote) places and classify it using a centralized algorithm. Although this approach has good performance, it is often infeasible due to high communication overheads and lack of scalability of the...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2015
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/62213 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | When the data at a location is insufficient, one may apply a naive solution to gather data from other (remote) places and classify it using a centralized algorithm. Although this approach has good performance, it is often infeasible due to high communication overheads and lack of scalability of the centralized solution. These concerns have led to the emergence of distributed classification. The promise of distributed classification is to improve the classification accuracy of a learning agent (called party) on its respective local data, using the knowledge of other parties in the distributed network. However, current explorations implicitly assume that all parties receive data from exactly the same distribution of data. We show that this is too simple a scenario, and that in reality, data across parties may be different from each other, in terms of both the data distribution of the inputs (observations) and/or the outputs (labels). We remove the current simplifying assumption by allowing parties to draw data from arbitrary distributions, thus formalizing a new and challenging problem of distributed classification with variable data distributions. We show that this problem is difficult, because it does not admit state-of-the-art solutions in the context of (conventional) distributed classification. After posing the problem and illustrating its difficulty, we present a list of remarkable research challenges (or sub-problems) that should be addressed in this challenging field. For each of those challenges, we provide some potential research directions. Finally, as the first attempt on this new problem, we present a simple-to-implement, straightforward yet working algorithm called VarDist that efficiently solves the problem where the data distribution may vary over the participating parties. Although VarDist is not a complete and sophisticated solution, it does have low costs of communication, while providing a more accurate classifier (than local learning) by benefiting from the auxiliary classifiers from the other parties. |
---|