Outsourcing Training without Uploading Data via Efficient Collaborative
Open-Source Sampling
- URL: http://arxiv.org/abs/2210.12575v1
- Date: Sun, 23 Oct 2022 00:12:18 GMT
- Title: Outsourcing Training without Uploading Data via Efficient Collaborative
Open-Source Sampling
- Authors: Junyuan Hong, Lingjuan Lyu, Jiayu Zhou, Michael Spranger
- Abstract summary: Traditional outsourcing requires uploading device data to the cloud server.
We propose to leverage widely available open-source data, which is a massive dataset collected from public and heterogeneous sources.
We develop a novel strategy called Efficient Collaborative Open-source Sampling (ECOS) to construct a proximal proxy dataset from open-source data for cloud training.
- Score: 49.87637449243698
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As deep learning blooms with growing demand for computation and data
resources, outsourcing model training to a powerful cloud server becomes an
attractive alternative to training at a low-power and cost-effective end
device. Traditional outsourcing requires uploading device data to the cloud
server, which can be infeasible in many real-world applications due to the
often sensitive nature of the collected data and the limited communication
bandwidth. To tackle these challenges, we propose to leverage widely available
open-source data, which is a massive dataset collected from public and
heterogeneous sources (e.g., Internet images). We develop a novel strategy
called Efficient Collaborative Open-source Sampling (ECOS) to construct a
proximal proxy dataset from open-source data for cloud training, in lieu of
client data. ECOS probes open-source data on the cloud server to sense the
distribution of client data via a communication- and computation-efficient
sampling process, which only communicates a few compressed public features and
client scalar responses. Extensive empirical studies show that the proposed
ECOS improves the quality of automated client labeling, model compression, and
label outsourcing when applied in various learning scenarios.
Related papers
- One-Shot Collaborative Data Distillation [9.428116807615407]
Large machine-learning training datasets can be distilled into small collections of informative synthetic data samples.
These synthetic sets support efficient model learning and reduce the communication cost of data sharing.
A naive way to construct a synthetic set in a distributed environment is to allow each client to perform local data distillation and to merge local distillations at a central server.
We introduce the first collaborative data distillation technique, called CollabDM, which captures the global distribution of the data and requires only a single round of communication between client and server.
arXiv Detail & Related papers (2024-08-05T06:47:32Z) - OpenDataLab: Empowering General Artificial Intelligence with Open Datasets [53.22840149601411]
This paper introduces OpenDataLab, a platform designed to bridge the gap between diverse data sources and the need for unified data processing.
OpenDataLab integrates a wide range of open-source AI datasets and enhances data acquisition efficiency through intelligent querying and high-speed downloading services.
We anticipate that OpenDataLab will significantly boost artificial general intelligence (AGI) research and facilitate advancements in related AI fields.
arXiv Detail & Related papers (2024-06-04T10:42:01Z) - CollaFuse: Navigating Limited Resources and Privacy in Collaborative Generative AI [5.331052581441263]
CollaFuse is a novel framework inspired by split learning.
It enables shared server training and inference, alleviating client computational burdens.
It has the potential to impact various application areas, such as the design of edge computing solutions, healthcare research, or autonomous driving.
arXiv Detail & Related papers (2024-02-29T12:36:10Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - Exploring One-shot Semi-supervised Federated Learning with A Pre-trained Diffusion Model [40.83058938096914]
We propose FedDISC, a Federated Diffusion-Inspired Semi-supervised Co-training method.
We first extract prototypes of the labeled server data and use these prototypes to predict pseudo-labels of the client data.
For each category, we compute the cluster centroids and domain-specific representations to signify the semantic and stylistic information of their distributions.
These representations are sent back to the server, which uses the pre-trained to generate synthetic datasets complying with the client distributions and train a global model on it.
arXiv Detail & Related papers (2023-05-06T14:22:33Z) - FedNet2Net: Saving Communication and Computations in Federated Learning
with Model Growing [0.0]
Federated learning (FL) is a recently developed area of machine learning.
In this paper, a novel scheme based on the notion of "model growing" is proposed.
The proposed approach is tested extensively on three standard benchmarks and is shown to achieve substantial reduction in communication and client computation.
arXiv Detail & Related papers (2022-07-19T21:54:53Z) - Scalable Neural Data Server: A Data Recommender for Transfer Learning [70.06289658553675]
Transfer learning is a popular strategy for leveraging additional data to improve the downstream performance.
Nerve Data Server (NDS), a search engine that recommends relevant data for a given downstream task, has been previously proposed to address this problem.
NDS uses a mixture of experts trained on data sources to estimate similarity between each source and the downstream task.
SNDS represents both data sources and downstream tasks by their proximity to the intermediary datasets.
arXiv Detail & Related papers (2022-06-19T12:07:32Z) - Data Selection for Efficient Model Update in Federated Learning [0.07614628596146598]
We propose to reduce the amount of local data that is needed to train a global model.
We do this by splitting the model into a lower part for generic feature extraction and an upper part that is more sensitive to the characteristics of the local data.
Our experiments show that less than 1% of the local data can transfer the characteristics of the client data to the global model.
arXiv Detail & Related papers (2021-11-05T14:07:06Z) - Federated Multi-Target Domain Adaptation [99.93375364579484]
Federated learning methods enable us to train machine learning models on distributed user data while preserving its privacy.
We consider a more practical scenario where the distributed client data is unlabeled, and a centralized labeled dataset is available on the server.
We propose an effective DualAdapt method to address the new challenges.
arXiv Detail & Related papers (2021-08-17T17:53:05Z) - Multi-modal AsynDGAN: Learn From Distributed Medical Image Data without
Sharing Private Information [55.866673486753115]
We propose an extendable and elastic learning framework to preserve privacy and security.
The proposed framework is named distributed Asynchronized Discriminator Generative Adrial Networks (AsynDGAN)
arXiv Detail & Related papers (2020-12-15T20:41:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.