Data Analysis Prediction over Multiple Unseen Datasets: A Vector Embedding Approach
- URL: http://arxiv.org/abs/2502.17060v2
- Date: Thu, 24 Apr 2025 08:36:47 GMT
- Title: Data Analysis Prediction over Multiple Unseen Datasets: A Vector Embedding Approach
- Authors: Andreas Loizou, Dimitrios Tsoumakos,
- Abstract summary: We propose a novel methodology that infers the outcome of analytics operators by creating a model from datasets similar to the queried one.<n>Our model can project different real-world scenarios to a lower vector embedding representation and distinguish between them.
- Score: 0.3683202928838613
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The massive increase in the data volume and dataset availability for analysts compels researchers to focus on data content and select high-quality datasets to enhance the performance of analytics operators. While selecting the highest quality data for analysis highly increases task accuracy and efficiency, it is still a hard task, especially when the number of available inputs is very large. To address this issue, we propose a novel methodology that infers the outcome of analytics operators by creating a model from datasets similar to the queried one. Dataset similarity is performed via projecting each dataset to a vector embedding representation. The vectorization process is performed using our proposed deep learning model NumTabData2Vec, which takes a whole dataset and projects it into a lower vector embedding representation space. Through experimental evaluation, we compare the prediction performance and the execution time of our framework to another state-of-the-art modelling operator framework, illustrating that our approach predicts analytics outcomes accurately. Furthermore, our vectorization model can project different real-world scenarios to a lower vector embedding representation and distinguish between them.
Related papers
- Towards Data-Efficient Pretraining for Atomic Property Prediction [51.660835328611626]
We show that pretraining on a task-relevant dataset can match or surpass large-scale pretraining.<n>We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fr'echet Inception Distance.
arXiv Detail & Related papers (2025-02-16T11:46:23Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - RepMatch: Quantifying Cross-Instance Similarities in Representation Space [15.215985417763472]
We introduce RepMatch, a novel method that characterizes data through the lens of similarity.
RepMatch quantifies the similarity between subsets of training instances by comparing the knowledge encoded in models trained on them.
We validate the effectiveness of RepMatch across multiple NLP tasks, datasets, and models.
arXiv Detail & Related papers (2024-10-12T20:42:28Z) - UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria.
We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets.
We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Performance Scaling via Optimal Transport: Enabling Data Selection from
Partially Revealed Sources [9.359395812292291]
This paper proposes a framework called or>, which predicts model performance and supports data selection decisions based on partial samples of prospective data sources.
or> significantly improves existing performance scaling approaches in terms of both accuracy of performance inference and computation costs associated with constructing the performance.
Also, or> outperforms by a wide margin in data selection effectiveness compared to a range of other off-the-shelf solutions.
arXiv Detail & Related papers (2023-07-05T17:33:41Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - A Case for Dataset Specific Profiling [0.9023847175654603]
Data-driven science is an emerging paradigm where scientific discoveries depend on the execution of computational AI models against rich, discipline-specific datasets.
With modern machine learning frameworks, anyone can develop and execute computational models that reveal concepts hidden in the data that could enable scientific applications.
For important and widely used datasets, computing the performance of every computational model that can run against a dataset is cost prohibitive in terms of cloud resources.
arXiv Detail & Related papers (2022-08-01T18:38:05Z) - MEAT: Maneuver Extraction from Agent Trajectories [9.919575841909962]
We propose an automated methodology that allows to extract maneuvers from agent trajectories in large-scale datasets.
Although it is possible to use the resulting maneuvers for training classification networks, we exemplary use them for extensive trajectory dataset analysis.
arXiv Detail & Related papers (2022-06-10T14:56:32Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - Quantifying the Complexity of Standard Benchmarking Datasets for
Long-Term Human Trajectory Prediction [8.870188183999852]
We propose an approach for quantifying the amount of information contained in a dataset from a prototype-based dataset representation.
A large-scale complexity analysis is conducted on several human trajectory prediction benchmarking datasets.
arXiv Detail & Related papers (2020-05-28T12:00:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.