Related papers: Geometric Dataset Distances via Optimal Transport

Geometric Dataset Distances via Optimal Transport

URL: http://arxiv.org/abs/2002.02923v1
Date: Fri, 7 Feb 2020 17:51:26 GMT
Title: Geometric Dataset Distances via Optimal Transport
Authors: David Alvarez-Melis and Nicol\`o Fusi
Abstract summary: We propose an alternative notion of distance between datasets that (i) is model-agnostic, (ii) does not involve training, (iii) can compare datasets even if their label sets are completely disjoint and (iv) has solid theoretical footing. This distance relies on optimal transport, which provides it with rich geometry awareness, interpretable correspondences and well-understood properties. Our results show that this novel distance provides meaningful comparison of datasets, and correlates well with transfer learning hardness across various experimental settings and datasets.
Score: 15.153110906331733
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The notion of task similarity is at the core of various machine learning paradigms, such as domain adaptation and meta-learning. Current methods to quantify it are often heuristic, make strong assumptions on the label sets across the tasks, and many are architecture-dependent, relying on task-specific optimal parameters (e.g., require training a model on each dataset). In this work we propose an alternative notion of distance between datasets that (i) is model-agnostic, (ii) does not involve training, (iii) can compare datasets even if their label sets are completely disjoint and (iv) has solid theoretical footing. This distance relies on optimal transport, which provides it with rich geometry awareness, interpretable correspondences and well-understood properties. Our results show that this novel distance provides meaningful comparison of datasets, and correlates well with transfer learning hardness across various experimental settings and datasets.

Related papers

GeoMM: On Geodesic Perspective for Multi-modal Learning [55.41612200877861]
This paper introduces geodesic distance as a novel distance metric in multi-modal learning for the first time.<n>Our approach incorporates a comprehensive series of strategies to adapt geodesic distance for the current multimodal learning.
arXiv Detail & Related papers (2025-05-16T13:12:41Z)
Lightspeed Geometric Dataset Distance via Sliced Optimal Transport [35.22009725098762]
We introduce sliced optimal transport dataset distance (s-OTDD), a model-agnostic, embedding-agnostic approach for dataset comparison. We derive a data point projection that transforms datasets into one-dimensional distributions.
arXiv Detail & Related papers (2025-01-31T05:42:58Z)
Symmetry Discovery for Different Data Types [52.2614860099811]
Equivariant neural networks incorporate symmetries into their architecture, achieving higher generalization performance. We propose LieSD, a method for discovering symmetries via trained neural networks which approximate the input-output mappings of the tasks. We validate the performance of LieSD on tasks with symmetries such as the two-body problem, the moment of inertia matrix prediction, and top quark tagging.
arXiv Detail & Related papers (2024-10-13T13:39:39Z)
UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria. We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets. We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z)
Generating Synthetic Datasets by Interpolating along Generalized Geodesics [18.278734644369052]
We show how to combine datasets that can be synthesised as "combinations" In particular, we show how to interpolate even between datasets with distinct and unrelated label sets. We demonstrate this is a promising new approach for targeted on-demand dataset synthesis.
arXiv Detail & Related papers (2023-06-12T04:46:44Z)
Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models. In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z)
Wasserstein Task Embedding for Measuring Task Similarities [14.095478018850374]
Measuring similarities between different tasks is critical in a broad spectrum of machine learning problems. We leverage the optimal transport theory and define a novel task embedding for supervised classification. We show that the proposed embedding leads to a significantly faster comparison of tasks compared to related approaches.
arXiv Detail & Related papers (2022-08-24T18:11:04Z)
Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned. It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets. The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z)
Dataset Distillation by Matching Training Trajectories [75.9031209877651]
We propose a new formulation that optimize our distilled data to guide networks to a similar state as those trained on real data. Given a network, we train it for several iterations on our distilled data and optimize the distilled data with respect to the distance between the synthetically trained parameters and the parameters trained on real data. Our method handily outperforms existing methods and also allows us to distill higher-resolution visual data.
arXiv Detail & Related papers (2022-03-22T17:58:59Z)
Mixing Deep Learning and Multiple Criteria Optimization: An Application to Distributed Learning with Multiple Datasets [0.0]
Training phase is the most important stage during the machine learning process. We develop a multiple criteria optimization model in which each criterion measures the distance between the output associated with a specific input and its label. We propose a scalarization approach to implement this model and numerical experiments in digit classification using MNIST data.
arXiv Detail & Related papers (2021-12-02T16:00:44Z)
A contribution to Optimal Transport on incomparable spaces [4.873362301533825]
This thesis proposes to study the complex scenario in which the different data belong to incomparable spaces. This thesis proposes a set of Optimal Transport tools for these different cases.
arXiv Detail & Related papers (2020-11-09T14:13:52Z)
An Information-Geometric Distance on the Space of Tasks [31.359578768463752]
This paper prescribes a distance between learning tasks modeled as joint distributions on data and labels. We develop an algorithm to compute the distance which iteratively transports the marginal on the data of the source task to that of the target task. We perform thorough empirical validation and analysis across diverse image classification datasets to show that the coupled transfer distance correlates strongly with the difficulty of fine-tuning.
arXiv Detail & Related papers (2020-11-01T19:48:39Z)
A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference. Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.