Related papers: Generating Synthetic Datasets by Interpolating along Generalized Geodesics

Generating Synthetic Datasets by Interpolating along Generalized Geodesics

URL: http://arxiv.org/abs/2306.06866v1
Date: Mon, 12 Jun 2023 04:46:44 GMT
Title: Generating Synthetic Datasets by Interpolating along Generalized Geodesics
Authors: Jiaojiao Fan and David Alvarez-Melis
Abstract summary: We show how to combine datasets that can be synthesised as "combinations" In particular, we show how to interpolate even between datasets with distinct and unrelated label sets. We demonstrate this is a promising new approach for targeted on-demand dataset synthesis.
Score: 18.278734644369052
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data for pretraining machine learning models often consists of collections of heterogeneous datasets. Although training on their union is reasonable in agnostic settings, it might be suboptimal when the target domain -- where the model will ultimately be used -- is known in advance. In that case, one would ideally pretrain only on the dataset(s) most similar to the target one. Instead of limiting this choice to those datasets already present in the pretraining collection, here we explore extending this search to all datasets that can be synthesized as `combinations' of them. We define such combinations as multi-dataset interpolations, formalized through the notion of generalized geodesics from optimal transport (OT) theory. We compute these geodesics using a recent notion of distance between labeled datasets, and derive alternative interpolation schemes based on it: using either barycentric projections or optimal transport maps, the latter computed using recent neural OT methods. These methods are scalable, efficient, and -- notably -- can be used to interpolate even between datasets with distinct and unrelated label sets. Through various experiments in transfer learning in computer vision, we demonstrate this is a promising new approach for targeted on-demand dataset synthesis.

Related papers

Heterogeneous Self-Supervised Acoustic Pre-Training with Local Constraints [64.15709757611369]
We propose a new self-supervised pre-training approach to dealing with heterogeneous data.<n>The proposed approach can significantly improve the adaptivity of the self-supervised pre-trained model for the downstream supervised fine-tuning tasks.
arXiv Detail & Related papers (2025-08-27T15:48:50Z)
Load Forecasting on A Highly Sparse Electrical Load Dataset Using Gaussian Interpolation [0.786975267379228]
Sparsity, defined as the presence of missing or zero values in a dataset, often poses a major challenge while operating on real-life datasets.<n>In this study, we show that an approximately 62% dataset with hourly load data of a power plant can be utilized for load forecasting assuming the data is Wide Sense Stationary (WSS)<n>More specifically, we perform statistical analysis on the data, and train multiple machine learning and deep learning models on the dataset.
arXiv Detail & Related papers (2025-08-12T03:15:45Z)
Core-Set Selection for Data-efficient Land Cover Segmentation [16.89537279044251]
We propose six novel core-set selection methods for selecting important subsets of samples from remote sensing image segmentation datasets.<n>We benchmark these approaches against a random-selection baseline on three commonly used land cover classification datasets.<n>This result shows the importance and potential of data-centric learning for the remote sensing domain.
arXiv Detail & Related papers (2025-05-02T12:22:08Z)
Automating Data Science Pipelines with Tensor Completion [4.956678070210018]
We model data science pipelines as instances of tensor completion. The goal is to identify all missing entries of the tensor, corresponding to all combinations of variable values. We extensively evaluate existing and proposed methods in a number of datasets.
arXiv Detail & Related papers (2024-10-08T22:34:08Z)
Personalized Federated Learning via Active Sampling [50.456464838807115]
This paper proposes a novel method for sequentially identifying similar (or relevant) data generators. Our method evaluates the relevance of a data generator by evaluating the effect of a gradient step using its local dataset. We extend this method to non-parametric models by a suitable generalization of the gradient step to update a hypothesis using the local dataset provided by a data generator.
arXiv Detail & Related papers (2024-09-03T17:12:21Z)
Entropic Optimal Transport Eigenmaps for Nonlinear Alignment and Joint Embedding of High-Dimensional Datasets [11.105392318582677]
We propose a principled approach for aligning and jointly embedding a pair of datasets with theoretical guarantees. Our approach leverages the leading singular vectors of the EOT plan matrix between two datasets to extract their shared underlying structure. We show that in a high-dimensional regime, the EOT plan recovers the shared manifold structure by approximating a kernel function evaluated at the locations of the latent variables.
arXiv Detail & Related papers (2024-07-01T18:48:55Z)
UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria. We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets. We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z)
Modified CycleGAN for the synthesization of samples for wheat head segmentation [0.09999629695552192]
In the absence of an annotated dataset, synthetic data can be used for model development. We develop a realistic annotated synthetic dataset for wheat head segmentation. The resulting model achieved a Dice score of 83.4% on an internal dataset and 83.6% on two external Global Wheat Head Detection datasets.
arXiv Detail & Related papers (2024-02-23T06:42:58Z)
Minimally Supervised Learning using Topological Projections in Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs) Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU) Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z)
Joint Distributional Learning via Cramer-Wold Distance [0.7614628596146602]
We introduce the Cramer-Wold distance regularization, which can be computed in a closed-form, to facilitate joint distributional learning for high-dimensional datasets. We also introduce a two-step learning method to enable flexible prior modeling and improve the alignment between the aggregated posterior and the prior distribution.
arXiv Detail & Related papers (2023-10-25T05:24:23Z)
Tackling Computational Heterogeneity in FL: A Few Theoretical Insights [68.8204255655161]
We introduce and analyse a novel aggregation framework that allows for formalizing and tackling computational heterogeneous data. Proposed aggregation algorithms are extensively analyzed from a theoretical, and an experimental prospective.
arXiv Detail & Related papers (2023-07-12T16:28:21Z)
Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage. We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets. By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z)
Simple multi-dataset detection [83.9604523643406]
We present a simple method for training a unified detector on multiple large-scale datasets. We show how to automatically integrate dataset-specific outputs into a common semantic taxonomy. Our approach does not require manual taxonomy reconciliation.
arXiv Detail & Related papers (2021-02-25T18:55:58Z)
AutoSimulate: (Quickly) Learning Synthetic Data Generation [70.82315853981838]
We propose an efficient alternative for optimal synthetic data generation based on a novel differentiable approximation of the objective. We demonstrate that the proposed method finds the optimal data distribution faster (up to $50times$), with significantly reduced training data generation (up to $30times$) and better accuracy ($+8.7%$) on real-world test datasets than previous methods.
arXiv Detail & Related papers (2020-08-16T11:36:11Z)
Geometric Dataset Distances via Optimal Transport [15.153110906331733]
We propose an alternative notion of distance between datasets that (i) is model-agnostic, (ii) does not involve training, (iii) can compare datasets even if their label sets are completely disjoint and (iv) has solid theoretical footing. This distance relies on optimal transport, which provides it with rich geometry awareness, interpretable correspondences and well-understood properties. Our results show that this novel distance provides meaningful comparison of datasets, and correlates well with transfer learning hardness across various experimental settings and datasets.
arXiv Detail & Related papers (2020-02-07T17:51:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.