Generating Synthetic Datasets by Interpolating along Generalized
Geodesics
- URL: http://arxiv.org/abs/2306.06866v1
- Date: Mon, 12 Jun 2023 04:46:44 GMT
- Title: Generating Synthetic Datasets by Interpolating along Generalized
Geodesics
- Authors: Jiaojiao Fan and David Alvarez-Melis
- Abstract summary: We show how to combine datasets that can be synthesised as "combinations"
In particular, we show how to interpolate even between datasets with distinct and unrelated label sets.
We demonstrate this is a promising new approach for targeted on-demand dataset synthesis.
- Score: 18.278734644369052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data for pretraining machine learning models often consists of collections of
heterogeneous datasets. Although training on their union is reasonable in
agnostic settings, it might be suboptimal when the target domain -- where the
model will ultimately be used -- is known in advance. In that case, one would
ideally pretrain only on the dataset(s) most similar to the target one. Instead
of limiting this choice to those datasets already present in the pretraining
collection, here we explore extending this search to all datasets that can be
synthesized as `combinations' of them. We define such combinations as
multi-dataset interpolations, formalized through the notion of generalized
geodesics from optimal transport (OT) theory. We compute these geodesics using
a recent notion of distance between labeled datasets, and derive alternative
interpolation schemes based on it: using either barycentric projections or
optimal transport maps, the latter computed using recent neural OT methods.
These methods are scalable, efficient, and -- notably -- can be used to
interpolate even between datasets with distinct and unrelated label sets.
Through various experiments in transfer learning in computer vision, we
demonstrate this is a promising new approach for targeted on-demand dataset
synthesis.
Related papers
- Automating Data Science Pipelines with Tensor Completion [4.956678070210018]
We model data science pipelines as instances of tensor completion.
The goal is to identify all missing entries of the tensor, corresponding to all combinations of variable values.
We extensively evaluate existing and proposed methods in a number of datasets.
arXiv Detail & Related papers (2024-10-08T22:34:08Z) - Personalized Federated Learning via Active Sampling [50.456464838807115]
This paper proposes a novel method for sequentially identifying similar (or relevant) data generators.
Our method evaluates the relevance of a data generator by evaluating the effect of a gradient step using its local dataset.
We extend this method to non-parametric models by a suitable generalization of the gradient step to update a hypothesis using the local dataset provided by a data generator.
arXiv Detail & Related papers (2024-09-03T17:12:21Z) - Entropic Optimal Transport Eigenmaps for Nonlinear Alignment and Joint Embedding of High-Dimensional Datasets [11.105392318582677]
We propose a principled approach for aligning and jointly embedding a pair of datasets with theoretical guarantees.
Our approach leverages the leading singular vectors of the EOT plan matrix between two datasets to extract their shared underlying structure.
We show that in a high-dimensional regime, the EOT plan recovers the shared manifold structure by approximating a kernel function evaluated at the locations of the latent variables.
arXiv Detail & Related papers (2024-07-01T18:48:55Z) - UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria.
We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets.
We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z) - Modified CycleGAN for the synthesization of samples for wheat head
segmentation [0.09999629695552192]
In the absence of an annotated dataset, synthetic data can be used for model development.
We develop a realistic annotated synthetic dataset for wheat head segmentation.
The resulting model achieved a Dice score of 83.4% on an internal dataset and 83.6% on two external Global Wheat Head Detection datasets.
arXiv Detail & Related papers (2024-02-23T06:42:58Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Joint Distributional Learning via Cramer-Wold Distance [0.7614628596146602]
We introduce the Cramer-Wold distance regularization, which can be computed in a closed-form, to facilitate joint distributional learning for high-dimensional datasets.
We also introduce a two-step learning method to enable flexible prior modeling and improve the alignment between the aggregated posterior and the prior distribution.
arXiv Detail & Related papers (2023-10-25T05:24:23Z) - Tackling Computational Heterogeneity in FL: A Few Theoretical Insights [68.8204255655161]
We introduce and analyse a novel aggregation framework that allows for formalizing and tackling computational heterogeneous data.
Proposed aggregation algorithms are extensively analyzed from a theoretical, and an experimental prospective.
arXiv Detail & Related papers (2023-07-12T16:28:21Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Simple multi-dataset detection [83.9604523643406]
We present a simple method for training a unified detector on multiple large-scale datasets.
We show how to automatically integrate dataset-specific outputs into a common semantic taxonomy.
Our approach does not require manual taxonomy reconciliation.
arXiv Detail & Related papers (2021-02-25T18:55:58Z) - Geometric Dataset Distances via Optimal Transport [15.153110906331733]
We propose an alternative notion of distance between datasets that (i) is model-agnostic, (ii) does not involve training, (iii) can compare datasets even if their label sets are completely disjoint and (iv) has solid theoretical footing.
This distance relies on optimal transport, which provides it with rich geometry awareness, interpretable correspondences and well-understood properties.
Our results show that this novel distance provides meaningful comparison of datasets, and correlates well with transfer learning hardness across various experimental settings and datasets.
arXiv Detail & Related papers (2020-02-07T17:51:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.