Wireless Dataset Similarity: Measuring Distances in Supervised and Unsupervised Machine Learning
- URL: http://arxiv.org/abs/2601.01023v1
- Date: Sat, 03 Jan 2026 01:15:27 GMT
- Title: Wireless Dataset Similarity: Measuring Distances in Supervised and Unsupervised Machine Learning
- Authors: João Morais, Sadjad Alikhani, Akshay Malhotra, Shahab Hamidi-Rad, Ahmed Alkhateeb,
- Abstract summary: This paper introduces a task- and model-aware framework for measuring similarity between wireless datasets.<n>It enables applications such as dataset selection/augmentation, simulation-to-real (sim2real) comparison, and informing decisions on model training/adaptation to new deployments.
- Score: 15.036550722400085
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a task- and model-aware framework for measuring similarity between wireless datasets, enabling applications such as dataset selection/augmentation, simulation-to-real (sim2real) comparison, task-specific synthetic data generation, and informing decisions on model training/adaptation to new deployments. We evaluate candidate dataset distance metrics by how well they predict cross-dataset transferability: if two datasets have a small distance, a model trained on one should perform well on the other. We apply the framework on an unsupervised task, channel state information (CSI) compression, using autoencoders. Using metrics based on UMAP embeddings, combined with Wasserstein and Euclidean distances, we achieve Pearson correlations exceeding 0.85 between dataset distances and train-on-one/test-on-another task performance. We also apply the framework to a supervised beam prediction in the downlink using convolutional neural networks. For this task, we derive a label-aware distance by integrating supervised UMAP and penalties for dataset imbalance. Across both tasks, the resulting distances outperform traditional baselines and consistently exhibit stronger correlations with model transferability, supporting task-relevant comparisons between wireless datasets.
Related papers
- SITUATE -- Synthetic Object Counting Dataset for VLM training [0.0]
We present SITUATE, a novel dataset designed for training and evaluating Vision Language Models.<n>The dataset bridges the gap between simple 2D datasets like VLMCountBench and often ambiguous real-life datasets like TallyQA.
arXiv Detail & Related papers (2026-01-26T16:17:53Z) - Cross-Learning from Scarce Data via Multi-Task Constrained Optimization [70.90607489166648]
This paper introduces a multi-task emphcross-learning framework to overcome data scarcity.<n>We formulate this joint estimation as a constrained optimization problem.<n>We show the efficiency of our cross-learning method in applications with real data including image classification and propagation of infectious diseases.
arXiv Detail & Related papers (2025-11-17T18:35:59Z) - Detect Anything via Next Point Prediction [51.55967987350882]
Rex- Omni is a 3B-scale MLLM that achieves state-of-the-art object perception performance.<n>On benchmarks like COCO and LVIS, Rex- Omni attains performance comparable to or exceeding regression-based models.
arXiv Detail & Related papers (2025-10-14T17:59:54Z) - What is the Right Notion of Distance between Predict-then-Optimize Tasks? [35.842182348661076]
OTD$3$ is a novel dataset distance that incorporates downstream decisions in addition to features and labels.<n>We show that our proposed distance accurately predicts model transferability across three different Predict-then- optimization tasks.
arXiv Detail & Related papers (2024-09-11T04:13:17Z) - Cross-domain Learning Framework for Tracking Users in RIS-aided Multi-band ISAC Systems with Sparse Labeled Data [55.70071704247794]
Integrated sensing and communications (ISAC) is pivotal for 6G communications and is boosted by the rapid development of reconfigurable intelligent surfaces (RISs)
This paper proposes the X2Track framework, where we model the tracking function by a hierarchical architecture, jointly utilizing multi-modal CSI indicators across multiple bands, and optimize it in a cross-domain manner.
Under X2Track, we design an efficient deep learning algorithm to minimize tracking errors, based on transformer neural networks and adversarial learning techniques.
arXiv Detail & Related papers (2024-05-10T08:04:27Z) - Improving Transferability for Cross-domain Trajectory Prediction via
Neural Stochastic Differential Equation [41.09061877498741]
discrepancies exist among datasets due to external factors and data acquisition strategies.
The proficient performance of models trained on large-scale datasets has limited transferability on other small-size datasets.
We propose a method based on continuous and utilization of Neural Differential Equations (NSDE) for alleviating discrepancies.
The effectiveness of our method is validated against state-of-the-art trajectory prediction models on the popular benchmark datasets: nuScenes, Argoverse, Lyft, INTERACTION, and Open Motion dataset.
arXiv Detail & Related papers (2023-12-26T06:50:29Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - Collaborative Learning with a Drone Orchestrator [79.75113006257872]
A swarm of intelligent wireless devices train a shared neural network model with the help of a drone.
The proposed framework achieves a significant speedup in training, leading to an average 24% and 87% saving in the drone hovering time.
arXiv Detail & Related papers (2023-03-03T23:46:25Z) - Wasserstein Task Embedding for Measuring Task Similarities [14.095478018850374]
Measuring similarities between different tasks is critical in a broad spectrum of machine learning problems.
We leverage the optimal transport theory and define a novel task embedding for supervised classification.
We show that the proposed embedding leads to a significantly faster comparison of tasks compared to related approaches.
arXiv Detail & Related papers (2022-08-24T18:11:04Z) - Dataset Distillation by Matching Training Trajectories [75.9031209877651]
We propose a new formulation that optimize our distilled data to guide networks to a similar state as those trained on real data.
Given a network, we train it for several iterations on our distilled data and optimize the distilled data with respect to the distance between the synthetically trained parameters and the parameters trained on real data.
Our method handily outperforms existing methods and also allows us to distill higher-resolution visual data.
arXiv Detail & Related papers (2022-03-22T17:58:59Z) - Geometric Dataset Distances via Optimal Transport [15.153110906331733]
We propose an alternative notion of distance between datasets that (i) is model-agnostic, (ii) does not involve training, (iii) can compare datasets even if their label sets are completely disjoint and (iv) has solid theoretical footing.
This distance relies on optimal transport, which provides it with rich geometry awareness, interpretable correspondences and well-understood properties.
Our results show that this novel distance provides meaningful comparison of datasets, and correlates well with transfer learning hardness across various experimental settings and datasets.
arXiv Detail & Related papers (2020-02-07T17:51:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.