SimEx: Express Prediction of Inter-dataset Similarity by a Fleet of
Autoencoders
- URL: http://arxiv.org/abs/2001.04893v1
- Date: Tue, 14 Jan 2020 16:52:50 GMT
- Title: SimEx: Express Prediction of Inter-dataset Similarity by a Fleet of
Autoencoders
- Authors: Inseok Hwang, Jinho Lee, Frank Liu, Minsik Cho
- Abstract summary: Knowing the similarity between sets of data has a number of positive implications in training an effective model.
We present SimEx, a new method for early prediction of inter-dataset similarity using a set of pretrained autoencoders.
Our method achieves more than 10x speed-up in predicting inter-dataset similarity compared to common similarity-estimating practices.
- Score: 13.55607978839719
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowing the similarity between sets of data has a number of positive
implications in training an effective model, such as assisting an informed
selection out of known datasets favorable to model transfer or data
augmentation problems with an unknown dataset. Common practices to estimate the
similarity between data include comparing in the original sample space,
comparing in the embedding space from a model performing a certain task, or
fine-tuning a pretrained model with different datasets and evaluating the
performance changes therefrom. However, these practices would suffer from
shallow comparisons, task-specific biases, or extensive time and computations
required to perform comparisons. We present SimEx, a new method for early
prediction of inter-dataset similarity using a set of pretrained autoencoders
each of which is dedicated to reconstructing a specific part of known data.
Specifically, our method takes unknown data samples as input to those
pretrained autoencoders, and evaluate the difference between the reconstructed
output samples against their original input samples. Our intuition is that, the
more similarity exists between the unknown data samples and the part of known
data that an autoencoder was trained with, the better chances there could be
that this autoencoder makes use of its trained knowledge, reconstructing output
samples closer to the originals. We demonstrate that our method achieves more
than 10x speed-up in predicting inter-dataset similarity compared to common
similarity-estimating practices. We also demonstrate that the inter-dataset
similarity estimated by our method is well-correlated with common practices and
outperforms the baselines approaches of comparing at sample- or
embedding-spaces, without newly training anything at the comparison time.
Related papers
- Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - Data Similarity is Not Enough to Explain Language Model Performance [6.364065652816667]
Similarity measures correlate with language model performance.
Similarity metrics are not correlated with accuracy or even each other.
This suggests that the relationship between pretraining data and downstream tasks is more complex than often assumed.
arXiv Detail & Related papers (2023-11-15T14:48:08Z) - Sample and Predict Your Latent: Modality-free Sequential Disentanglement
via Contrastive Estimation [2.7759072740347017]
We introduce a self-supervised sequential disentanglement framework based on contrastive estimation with no external signals.
In practice, we propose a unified, efficient, and easy-to-code sampling strategy for semantically similar and dissimilar views of the data.
Our method presents state-of-the-art results in comparison to existing techniques.
arXiv Detail & Related papers (2023-05-25T10:50:30Z) - Multi-Task Self-Supervised Time-Series Representation Learning [3.31490164885582]
Time-series representation learning can extract representations from data with temporal dynamics and sparse labels.
We propose a new time-series representation learning method by combining the advantages of self-supervised tasks.
We evaluate the proposed framework on three downstream tasks: time-series classification, forecasting, and anomaly detection.
arXiv Detail & Related papers (2023-03-02T07:44:06Z) - Dataset Condensation with Latent Space Knowledge Factorization and
Sharing [73.31614936678571]
We introduce a novel approach for solving dataset condensation problem by exploiting the regularity in a given dataset.
Instead of condensing the dataset directly in the original input space, we assume a generative process of the dataset with a set of learnable codes.
We experimentally show that our method achieves new state-of-the-art records by significant margins on various benchmark datasets.
arXiv Detail & Related papers (2022-08-21T18:14:08Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - Approximate Bayesian Computation with Path Signatures [0.5156484100374059]
We introduce the use of path signatures as a natural candidate feature set for constructing distances between time series data.
Our experiments show that such an approach can generate more accurate approximate Bayesian posteriors than existing techniques for time series models.
arXiv Detail & Related papers (2021-06-23T17:25:43Z) - Learning from Incomplete Features by Simultaneous Training of Neural
Networks and Sparse Coding [24.3769047873156]
This paper addresses the problem of training a classifier on a dataset with incomplete features.
We assume that different subsets of features (random or structured) are available at each data instance.
A new supervised learning method is developed to train a general classifier, using only a subset of features per sample.
arXiv Detail & Related papers (2020-11-28T02:20:39Z) - Few-shot Visual Reasoning with Meta-analogical Contrastive Learning [141.2562447971]
We propose to solve a few-shot (or low-shot) visual reasoning problem, by resorting to analogical reasoning.
We extract structural relationships between elements in both domains, and enforce them to be as similar as possible with analogical learning.
We validate our method on RAVEN dataset, on which it outperforms state-of-the-art method, with larger gains when the training data is scarce.
arXiv Detail & Related papers (2020-07-23T14:00:34Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.