Geometric Dataset Distances via Optimal Transport
- URL: http://arxiv.org/abs/2002.02923v1
- Date: Fri, 7 Feb 2020 17:51:26 GMT
- Title: Geometric Dataset Distances via Optimal Transport
- Authors: David Alvarez-Melis and Nicol\`o Fusi
- Abstract summary: We propose an alternative notion of distance between datasets that (i) is model-agnostic, (ii) does not involve training, (iii) can compare datasets even if their label sets are completely disjoint and (iv) has solid theoretical footing.
This distance relies on optimal transport, which provides it with rich geometry awareness, interpretable correspondences and well-understood properties.
Our results show that this novel distance provides meaningful comparison of datasets, and correlates well with transfer learning hardness across various experimental settings and datasets.
- Score: 15.153110906331733
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The notion of task similarity is at the core of various machine learning
paradigms, such as domain adaptation and meta-learning. Current methods to
quantify it are often heuristic, make strong assumptions on the label sets
across the tasks, and many are architecture-dependent, relying on task-specific
optimal parameters (e.g., require training a model on each dataset). In this
work we propose an alternative notion of distance between datasets that (i) is
model-agnostic, (ii) does not involve training, (iii) can compare datasets even
if their label sets are completely disjoint and (iv) has solid theoretical
footing. This distance relies on optimal transport, which provides it with rich
geometry awareness, interpretable correspondences and well-understood
properties. Our results show that this novel distance provides meaningful
comparison of datasets, and correlates well with transfer learning hardness
across various experimental settings and datasets.
Related papers
- Symmetry Discovery for Different Data Types [52.2614860099811]
Equivariant neural networks incorporate symmetries into their architecture, achieving higher generalization performance.
We propose LieSD, a method for discovering symmetries via trained neural networks which approximate the input-output mappings of the tasks.
We validate the performance of LieSD on tasks with symmetries such as the two-body problem, the moment of inertia matrix prediction, and top quark tagging.
arXiv Detail & Related papers (2024-10-13T13:39:39Z) - UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria.
We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets.
We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z) - Generating Synthetic Datasets by Interpolating along Generalized
Geodesics [18.278734644369052]
We show how to combine datasets that can be synthesised as "combinations"
In particular, we show how to interpolate even between datasets with distinct and unrelated label sets.
We demonstrate this is a promising new approach for targeted on-demand dataset synthesis.
arXiv Detail & Related papers (2023-06-12T04:46:44Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - Wasserstein Task Embedding for Measuring Task Similarities [14.095478018850374]
Measuring similarities between different tasks is critical in a broad spectrum of machine learning problems.
We leverage the optimal transport theory and define a novel task embedding for supervised classification.
We show that the proposed embedding leads to a significantly faster comparison of tasks compared to related approaches.
arXiv Detail & Related papers (2022-08-24T18:11:04Z) - Detection Hub: Unifying Object Detection Datasets via Query Adaptation
on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned.
It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets.
The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z) - Dataset Distillation by Matching Training Trajectories [75.9031209877651]
We propose a new formulation that optimize our distilled data to guide networks to a similar state as those trained on real data.
Given a network, we train it for several iterations on our distilled data and optimize the distilled data with respect to the distance between the synthetically trained parameters and the parameters trained on real data.
Our method handily outperforms existing methods and also allows us to distill higher-resolution visual data.
arXiv Detail & Related papers (2022-03-22T17:58:59Z) - Mixing Deep Learning and Multiple Criteria Optimization: An Application
to Distributed Learning with Multiple Datasets [0.0]
Training phase is the most important stage during the machine learning process.
We develop a multiple criteria optimization model in which each criterion measures the distance between the output associated with a specific input and its label.
We propose a scalarization approach to implement this model and numerical experiments in digit classification using MNIST data.
arXiv Detail & Related papers (2021-12-02T16:00:44Z) - A contribution to Optimal Transport on incomparable spaces [4.873362301533825]
This thesis proposes to study the complex scenario in which the different data belong to incomparable spaces.
This thesis proposes a set of Optimal Transport tools for these different cases.
arXiv Detail & Related papers (2020-11-09T14:13:52Z) - An Information-Geometric Distance on the Space of Tasks [31.359578768463752]
This paper prescribes a distance between learning tasks modeled as joint distributions on data and labels.
We develop an algorithm to compute the distance which iteratively transports the marginal on the data of the source task to that of the target task.
We perform thorough empirical validation and analysis across diverse image classification datasets to show that the coupled transfer distance correlates strongly with the difficulty of fine-tuning.
arXiv Detail & Related papers (2020-11-01T19:48:39Z) - A Trainable Optimal Transport Embedding for Feature Aggregation and its
Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference.
Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.