Optimize Cardinality Estimation Model Pretraining by Simplifying the Training Datasets
- URL: http://arxiv.org/abs/2502.14350v1
- Date: Thu, 20 Feb 2025 08:06:16 GMT
- Title: Optimize Cardinality Estimation Model Pretraining by Simplifying the Training Datasets
- Authors: Boyang Fang,
- Abstract summary: We introduce a simplified training dataset, which has been reduced to a fraction of the size of existing pretraining datasets.
Sufficient experimental results demonstrate that the pre-trained cardinality estimator based on this simplified dataset can still achieve comparable performance to existing models in zero-shot setups.
- Score: 0.0
- License:
- Abstract: The cardinality estimation is a key aspect of query optimization research, and its performance has significantly improved with the integration of machine learning. To overcome the "cold start" problem or the lack of model transferability in learned cardinality estimators, some pre-training cardinality estimation models have been proposed that use learning across multiple datasets and corresponding workloads. These models typically train on a dataset created by uniformly sampling from many datasets, but this approach may not be optimal. By applying the Group Distributionally Robust Optimization (Group DRO) algorithm to training datasets, we find that some specific training datasets contribute more significantly to model performance than others. Based on this observation, we conduct extensive experiments to delve deeper into pre-training cardinality estimators. Our results show how the performance of these models can be influenced by the datasets and corresponding workloads. Finally, we introduce a simplified training dataset, which has been reduced to a fraction of the size of existing pretraining datasets. Sufficient experimental results demonstrate that the pre-trained cardinality estimator based on this simplified dataset can still achieve comparable performance to existing models in zero-shot setups.
Related papers
- Scalable Data Ablation Approximations for Language Models through Modular Training and Merging [27.445079398772904]
We propose an efficient method for approximating data ablations which trains individual models on subsets of a training corpus.
We find that, given an arbitrary evaluation set, the perplexity score of a single model trained on a candidate set of data is strongly correlated with perplexity scores of parameter averages of models trained on distinct partitions of that data.
arXiv Detail & Related papers (2024-10-21T06:03:49Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Optimizing V-information for Self-Supervised Pre-training Data-Effective Medical Foundation Models [15.413974936297082]
Self-supervised pre-training medical foundation models on large-scale datasets demonstrate exceptional performance.
Data-effective learning approaches have been introduced to select valuable samples for foundation model pre-training.
We develop an optimal data-effective learning method to optimize V-information in real-world medical domains.
arXiv Detail & Related papers (2024-08-13T10:28:54Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - On the Trade-off of Intra-/Inter-class Diversity for Supervised
Pre-training [72.8087629914444]
We study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset.
With the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity.
arXiv Detail & Related papers (2023-05-20T16:23:50Z) - Dataset Pruning: Reducing Training Data by Examining Generalization
Influence [30.30255670341501]
Do all training data contribute to model's performance?
How to construct a smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance?
arXiv Detail & Related papers (2022-05-19T05:36:35Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - Adaptive Sampling Strategies to Construct Equitable Training Datasets [0.7036032466145111]
In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities.
One factor contributing to these performance gaps is a lack of representation in the data the models are trained on.
We formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem.
arXiv Detail & Related papers (2022-01-31T19:19:30Z) - Improved Fine-tuning by Leveraging Pre-training Data: Theory and
Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications.
Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy.
We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.