Combining datasets to increase the number of samples and improve model
fitting
- URL: http://arxiv.org/abs/2210.05165v2
- Date: Tue, 16 May 2023 08:47:23 GMT
- Title: Combining datasets to increase the number of samples and improve model
fitting
- Authors: Thu Nguyen, Rabindra Khadka, Nhan Phan, Anis Yazidi, P{\aa}l
Halvorsen, Michael A. Riegler
- Abstract summary: We propose a novel framework called Combine datasets based on Imputation (ComImp)
In addition, we propose a variant of ComImp that uses Principle Component Analysis (PCA), PCA-ComImp in order to reduce dimension before combining datasets.
Our results indicate that the proposed methods are somewhat similar to transfer learning in that the merge can significantly improve the accuracy of a prediction model on smaller datasets.
- Score: 7.4771091238795595
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: For many use cases, combining information from different datasets can be of
interest to improve a machine learning model's performance, especially when the
number of samples from at least one of the datasets is small. However, a
potential challenge in such cases is that the features from these datasets are
not identical, even though there are some commonly shared features among the
datasets. To tackle this challenge, we propose a novel framework called Combine
datasets based on Imputation (ComImp). In addition, we propose a variant of
ComImp that uses Principle Component Analysis (PCA), PCA-ComImp in order to
reduce dimension before combining datasets. This is useful when the datasets
have a large number of features that are not shared between them. Furthermore,
our framework can also be utilized for data preprocessing by imputing missing
data, i.e., filling in the missing entries while combining different datasets.
To illustrate the power of the proposed methods and their potential usages, we
conduct experiments for various tasks: regression, classification, and for
different data types: tabular data, time series data, when the datasets to be
combined have missing data. We also investigate how the devised methods can be
used with transfer learning to provide even further model training improvement.
Our results indicate that the proposed methods are somewhat similar to transfer
learning in that the merge can significantly improve the accuracy of a
prediction model on smaller datasets. In addition, the methods can boost
performance by a significant margin when combining small datasets together and
can provide extra improvement when being used with transfer learning.
Related papers
- Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning.
We construct pseudo-skill clusters by grouping gradient-based sample vectors.
We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - Dataset Quantization with Active Learning based Adaptive Sampling [11.157462442942775]
We show that maintaining performance is feasible even with uneven sample distributions.
We propose a novel active learning based adaptive sampling strategy to optimize the sample selection.
Our approach outperforms the state-of-the-art dataset compression methods.
arXiv Detail & Related papers (2024-07-09T23:09:18Z) - UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria.
We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets.
We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Revisiting Permutation Symmetry for Merging Models between Different
Datasets [3.234560001579257]
We investigate the properties of merging models between different datasets.
We find that the accuracy of the merged model decreases more significantly as the datasets diverge more.
We show that condensed datasets created by dataset condensation can be used as substitutes for the original datasets.
arXiv Detail & Related papers (2023-06-09T03:00:34Z) - Neural Network Architecture for Database Augmentation Using Shared
Features [0.0]
Inherent challenges in some domains such as medicine make it difficult to create large single source datasets or multi-source datasets with identical features.
We propose a neural network architecture that can provide data augmentation using features common between these datasets.
arXiv Detail & Related papers (2023-02-02T19:17:06Z) - A Case for Dataset Specific Profiling [0.9023847175654603]
Data-driven science is an emerging paradigm where scientific discoveries depend on the execution of computational AI models against rich, discipline-specific datasets.
With modern machine learning frameworks, anyone can develop and execute computational models that reveal concepts hidden in the data that could enable scientific applications.
For important and widely used datasets, computing the performance of every computational model that can run against a dataset is cost prohibitive in terms of cloud resources.
arXiv Detail & Related papers (2022-08-01T18:38:05Z) - Single-dataset Experts for Multi-dataset Question Answering [6.092171111087768]
We train a network on multiple datasets to generalize and transfer better to new datasets.
Our approach is to model multi-dataset question answering with a collection of single-dataset experts.
Simple methods based on parameter-averaging lead to better zero-shot generalization and few-shot transfer performance.
arXiv Detail & Related papers (2021-09-28T17:08:22Z) - DAIL: Dataset-Aware and Invariant Learning for Face Recognition [67.4903809903022]
To achieve good performance in face recognition, a large scale training dataset is usually required.
It is problematic and troublesome to naively combine different datasets due to two major issues.
Naively treating the same person as different classes in different datasets during training will affect back-propagation.
manually cleaning labels may take formidable human efforts, especially when there are millions of images and thousands of identities.
arXiv Detail & Related papers (2021-01-14T01:59:52Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.