Distance in Latent Space as Novelty Measure
- URL: http://arxiv.org/abs/2003.14043v1
- Date: Tue, 31 Mar 2020 09:14:56 GMT
- Title: Distance in Latent Space as Novelty Measure
- Authors: Mark Philip Philipsen and Thomas Baltzer Moeslund
- Abstract summary: We propose to intelligently select samples when constructing data sets.
The selection methodology is based on the presumption that two dissimilar samples are worth more than two similar samples in a data set.
By using a self-supervised method to construct the latent space, it is ensured that the space fits the data well and that any upfront labeling effort can be avoided.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep Learning performs well when training data densely covers the experience
space. For complex problems this makes data collection prohibitively expensive.
We propose to intelligently select samples when constructing data sets in order
to best utilize the available labeling budget. The selection methodology is
based on the presumption that two dissimilar samples are worth more than two
similar samples in a data set. Similarity is measured based on the Euclidean
distance between samples in the latent space produced by a DNN. By using a
self-supervised method to construct the latent space, it is ensured that the
space fits the data well and that any upfront labeling effort can be avoided.
The result is more efficient, diverse, and balanced data set, which produce
equal or superior results with fewer labeled examples.
Related papers
- Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - Dataset Quantization with Active Learning based Adaptive Sampling [11.157462442942775]
We show that maintaining performance is feasible even with uneven sample distributions.
We propose a novel active learning based adaptive sampling strategy to optimize the sample selection.
Our approach outperforms the state-of-the-art dataset compression methods.
arXiv Detail & Related papers (2024-07-09T23:09:18Z) - D2 Pruning: Message Passing for Balancing Diversity and Difficulty in
Data Pruning [70.98091101459421]
Coreset selection seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset.
We propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection.
Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates.
arXiv Detail & Related papers (2023-10-11T23:01:29Z) - Differences Between Hard and Noisy-labeled Samples: An Empirical Study [7.132368785057315]
noisy or incorrectly labeled samples from a labeled dataset with hard/difficult samples is an important yet under-explored topic.
We introduce a simple yet effective metric that filters out noisy-labeled samples while keeping the hard samples.
Our proposed data partitioning method significantly outperforms other methods when employed within a semi-supervised learning framework.
arXiv Detail & Related papers (2023-07-20T09:24:23Z) - Linking data separation, visual separation, and classifier performance
using pseudo-labeling by contrastive learning [125.99533416395765]
We argue that the performance of the final classifier depends on the data separation present in the latent space and visual separation present in the projection.
We demonstrate our results by the classification of five real-world challenging image datasets of human intestinal parasites with only 1% supervised samples.
arXiv Detail & Related papers (2023-02-06T10:01:38Z) - Shared Manifold Learning Using a Triplet Network for Multiple Sensor
Translation and Fusion with Missing Data [2.452410403088629]
We propose a Contrastive learning based MultiModal Alignment Network (CoMMANet) to align data from different sensors into a shared and discriminative manifold.
The proposed architecture uses a multimodal triplet autoencoder to cluster the latent space in such a way that samples of the same classes from each heterogeneous modality are mapped close to each other.
arXiv Detail & Related papers (2022-10-25T20:22:09Z) - Pareto Optimization for Active Learning under Out-of-Distribution Data
Scenarios [79.02009938011447]
We propose a sampling scheme, which selects optimal subsets of unlabeled samples with fixed batch size from the unlabeled data pool.
Experimental results show its effectiveness on both classical Machine Learning (ML) and Deep Learning (DL) tasks.
arXiv Detail & Related papers (2022-07-04T04:11:44Z) - Combining Observational and Randomized Data for Estimating Heterogeneous
Treatment Effects [82.20189909620899]
Estimating heterogeneous treatment effects is an important problem across many domains.
Currently, most existing works rely exclusively on observational data.
We propose to estimate heterogeneous treatment effects by combining large amounts of observational data and small amounts of randomized data.
arXiv Detail & Related papers (2022-02-25T18:59:54Z) - Data Generation in Low Sample Size Setting Using Manifold Sampling and a
Geometry-Aware VAE [0.0]
We develop two non emphprior-dependent generation procedures based on the geometry of the latent space.
The latter method is used to perform data augmentation in a small sample size setting and is validated across various standard and emphreal-life data sets.
arXiv Detail & Related papers (2021-03-25T11:07:10Z) - Multi-Task Curriculum Framework for Open-Set Semi-Supervised Learning [54.85397562961903]
Semi-supervised learning (SSL) has been proposed to leverage unlabeled data for training powerful models when only limited labeled data is available.
We address a more complex novel scenario named open-set SSL, where out-of-distribution (OOD) samples are contained in unlabeled data.
Our method achieves state-of-the-art results by successfully eliminating the effect of OOD samples.
arXiv Detail & Related papers (2020-07-22T10:33:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.