Scalable Neural Data Server: A Data Recommender for Transfer Learning
- URL: http://arxiv.org/abs/2206.09386v1
- Date: Sun, 19 Jun 2022 12:07:32 GMT
- Title: Scalable Neural Data Server: A Data Recommender for Transfer Learning
- Authors: Tianshi Cao, Sasha Doubov, David Acuna, Sanja Fidler
- Abstract summary: Transfer learning is a popular strategy for leveraging additional data to improve the downstream performance.
Nerve Data Server (NDS), a search engine that recommends relevant data for a given downstream task, has been previously proposed to address this problem.
NDS uses a mixture of experts trained on data sources to estimate similarity between each source and the downstream task.
SNDS represents both data sources and downstream tasks by their proximity to the intermediary datasets.
- Score: 70.06289658553675
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Absence of large-scale labeled data in the practitioner's target domain can
be a bottleneck to applying machine learning algorithms in practice. Transfer
learning is a popular strategy for leveraging additional data to improve the
downstream performance, but finding the most relevant data to transfer from can
be challenging. Neural Data Server (NDS), a search engine that recommends
relevant data for a given downstream task, has been previously proposed to
address this problem. NDS uses a mixture of experts trained on data sources to
estimate similarity between each source and the downstream task. Thus, the
computational cost to each user grows with the number of sources. To address
these issues, we propose Scalable Neural Data Server (SNDS), a large-scale
search engine that can theoretically index thousands of datasets to serve
relevant ML data to end users. SNDS trains the mixture of experts on
intermediary datasets during initialization, and represents both data sources
and downstream tasks by their proximity to the intermediary datasets. As such,
computational cost incurred by SNDS users remains fixed as new datasets are
added to the server. We validate SNDS on a plethora of real world tasks and
find that data recommended by SNDS improves downstream task performance over
baselines. We also demonstrate the scalability of SNDS by showing its ability
to select relevant data for transfer outside of the natural image setting.
Related papers
- How Much Data are Enough? Investigating Dataset Requirements for Patch-Based Brain MRI Segmentation Tasks [74.21484375019334]
Training deep neural networks reliably requires access to large-scale datasets.
To mitigate both the time and financial costs associated with model development, a clear understanding of the amount of data required to train a satisfactory model is crucial.
This paper proposes a strategic framework for estimating the amount of annotated data required to train patch-based segmentation networks.
arXiv Detail & Related papers (2024-04-04T13:55:06Z) - A Novel Neural Network-Based Federated Learning System for Imbalanced
and Non-IID Data [2.9642661320713555]
Most machine learning algorithms rely heavily on large amount of data which may be collected from various sources.
To combat this issue, researchers have introduced federated learning, where a prediction model is learnt by ensuring the privacy of data of clients data.
In this research, we propose a centralized, neural network-based federated learning system.
arXiv Detail & Related papers (2023-11-16T17:14:07Z) - Data Filtering Networks [67.827994353269]
We study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset.
Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks.
Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets.
arXiv Detail & Related papers (2023-09-29T17:37:29Z) - Outsourcing Training without Uploading Data via Efficient Collaborative
Open-Source Sampling [49.87637449243698]
Traditional outsourcing requires uploading device data to the cloud server.
We propose to leverage widely available open-source data, which is a massive dataset collected from public and heterogeneous sources.
We develop a novel strategy called Efficient Collaborative Open-source Sampling (ECOS) to construct a proximal proxy dataset from open-source data for cloud training.
arXiv Detail & Related papers (2022-10-23T00:12:18Z) - Collaborative Self Organizing Map with DeepNNs for Fake Task Prevention
in Mobile Crowdsensing [26.6224977032229]
Mobile Crowdsensing (MCS) is a sensing paradigm that has transformed the way that various service providers collect, process, and analyze data.
Various threats, such as data poisoning, clogging task attacks and fake sensing tasks adversely affect the performance of MCS systems.
In this work, Self Organizing Feature Map (SOFM), an artificial neural network that is trained in an unsupervised manner, is utilized to pre-cluster the legitimate data in the dataset.
arXiv Detail & Related papers (2022-02-17T04:56:28Z) - IADA: Iterative Adversarial Data Augmentation Using Formal Verification
and Expert Guidance [1.599072005190786]
We propose an iterative adversarial data augmentation framework to learn neural network models.
The proposed framework is applied to an artificial 2D dataset, the MNIST dataset, and a human motion dataset.
We show that our training method can improve the robustness and accuracy of the learned model.
arXiv Detail & Related papers (2021-08-16T03:05:53Z) - Neural Data Server: A Large-Scale Search Engine for Transfer Learning
Data [78.74367441804183]
We introduce Neural Data Server (NDS), a large-scale search engine for finding the most useful transfer learning data to the target domain.
NDS consists of a dataserver which indexes several large popular image datasets, and aims to recommend data to a client.
We show the effectiveness of NDS in various transfer learning scenarios, demonstrating state-of-the-art performance on several target datasets.
arXiv Detail & Related papers (2020-01-09T01:21:30Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.