Overcoming Noisy and Irrelevant Data in Federated Learning
- URL: http://arxiv.org/abs/2001.08300v2
- Date: Tue, 23 Jun 2020 02:12:29 GMT
- Title: Overcoming Noisy and Irrelevant Data in Federated Learning
- Authors: Tiffany Tuor, Shiqiang Wang, Bong Jun Ko, Changchang Liu, Kin K. Leung
- Abstract summary: Federated learning is an effective way of training a machine learning model in a distributed manner from local data collected by client devices.
We propose a method for distributedly selecting relevant data, where we use a benchmark model trained on a small benchmark dataset.
The effectiveness of our proposed approach is evaluated on multiple real-world image datasets in a simulated system with a large number of clients.
- Score: 13.963024590508038
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many image and vision applications require a large amount of data for model
training. Collecting all such data at a central location can be challenging due
to data privacy and communication bandwidth restrictions. Federated learning is
an effective way of training a machine learning model in a distributed manner
from local data collected by client devices, which does not require exchanging
the raw data among clients. A challenge is that among the large variety of data
collected at each client, it is likely that only a subset is relevant for a
learning task while the rest of data has a negative impact on model training.
Therefore, before starting the learning process, it is important to select the
subset of data that is relevant to the given federated learning task. In this
paper, we propose a method for distributedly selecting relevant data, where we
use a benchmark model trained on a small benchmark dataset that is
task-specific, to evaluate the relevance of individual data samples at each
client and select the data with sufficiently high relevance. Then, each client
only uses the selected subset of its data in the federated learning process.
The effectiveness of our proposed approach is evaluated on multiple real-world
image datasets in a simulated system with a large number of clients, showing up
to $25\%$ improvement in model accuracy compared to training with all data.
Related papers
- Dual-Criterion Model Aggregation in Federated Learning: Balancing Data Quantity and Quality [0.0]
Federated learning (FL) has become one of the key methods for privacy-preserving collaborative learning.
An aggregation algorithm is recognized as one of the most crucial components for ensuring the efficacy and security of the system.
This study proposes a novel dual-criterion weighted aggregation algorithm involving the quantity and quality of data from the client node.
arXiv Detail & Related papers (2024-11-12T14:09:16Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - FedSampling: A Better Sampling Strategy for Federated Learning [81.85411484302952]
Federated learning (FL) is an important technique for learning models from decentralized data in a privacy-preserving way.
Existing FL methods usually uniformly sample clients for local model learning in each round.
We propose a novel data uniform sampling strategy for federated learning (FedSampling)
arXiv Detail & Related papers (2023-06-25T13:38:51Z) - Exploring Data Redundancy in Real-world Image Classification through
Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs.
We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data.
Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z) - Data Selection for Efficient Model Update in Federated Learning [0.07614628596146598]
We propose to reduce the amount of local data that is needed to train a global model.
We do this by splitting the model into a lower part for generic feature extraction and an upper part that is more sensitive to the characteristics of the local data.
Our experiments show that less than 1% of the local data can transfer the characteristics of the client data to the global model.
arXiv Detail & Related papers (2021-11-05T14:07:06Z) - Federated Multi-Target Domain Adaptation [99.93375364579484]
Federated learning methods enable us to train machine learning models on distributed user data while preserving its privacy.
We consider a more practical scenario where the distributed client data is unlabeled, and a centralized labeled dataset is available on the server.
We propose an effective DualAdapt method to address the new challenges.
arXiv Detail & Related papers (2021-08-17T17:53:05Z) - Decentralized federated learning of deep neural networks on non-iid data [0.6335848702857039]
We tackle the non-problem of learning a personalized deep learning model in a decentralized setting.
We propose a method named Performance-Based Neighbor Selection (PENS) where clients with similar data detect each other and cooperate.
PENS is able to achieve higher accuracies as compared to strong baselines.
arXiv Detail & Related papers (2021-07-18T19:05:44Z) - Exploiting Shared Representations for Personalized Federated Learning [54.65133770989836]
We propose a novel federated learning framework and algorithm for learning a shared data representation across clients and unique local heads for each client.
Our algorithm harnesses the distributed computational power across clients to perform many local-updates with respect to the low-dimensional local parameters for every update of the representation.
This result is of interest beyond federated learning to a broad class of problems in which we aim to learn a shared low-dimensional representation among data distributions.
arXiv Detail & Related papers (2021-02-14T05:36:25Z) - Neural Data Server: A Large-Scale Search Engine for Transfer Learning
Data [78.74367441804183]
We introduce Neural Data Server (NDS), a large-scale search engine for finding the most useful transfer learning data to the target domain.
NDS consists of a dataserver which indexes several large popular image datasets, and aims to recommend data to a client.
We show the effectiveness of NDS in various transfer learning scenarios, demonstrating state-of-the-art performance on several target datasets.
arXiv Detail & Related papers (2020-01-09T01:21:30Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.