Consistent and Flexible Selectivity Estimation for High-Dimensional Data
- URL: http://arxiv.org/abs/2005.09908v4
- Date: Thu, 27 May 2021 15:14:51 GMT
- Title: Consistent and Flexible Selectivity Estimation for High-Dimensional Data
- Authors: Yaoshu Wang, Chuan Xiao, Jianbin Qin, Rui Mao, Onizuka Makoto, Wei
Wang, Rui Zhang, Yoshiharu Ishikawa
- Abstract summary: We propose a new deep learning-based model that learns a query-dependent piecewise linear function as selectivity estimator.
We show that the proposed model consistently outperforms state-of-the-art models in accuracy in an efficient way.
- Score: 23.016360687961193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Selectivity estimation aims at estimating the number of database objects that
satisfy a selection criterion. Answering this problem accurately and
efficiently is essential to many applications, such as density estimation,
outlier detection, query optimization, and data integration. The estimation
problem is especially challenging for large-scale high-dimensional data due to
the curse of dimensionality, the large variance of selectivity across different
queries, and the need to make the estimator consistent (i.e., the selectivity
is non-decreasing in the threshold). We propose a new deep learning-based model
that learns a query-dependent piecewise linear function as selectivity
estimator, which is flexible to fit the selectivity curve of any distance
function and query object, while guaranteeing that the output is non-decreasing
in the threshold. To improve the accuracy for large datasets, we propose to
partition the dataset into multiple disjoint subsets and build a local model on
each of them. We perform experiments on real datasets and show that the
proposed model consistently outperforms state-of-the-art models in accuracy in
an efficient way and is useful for real applications.
Related papers
- Unifying and Optimizing Data Values for Selection via Sequential-Decision-Making [5.755427480127593]
We show that data values applied for selection can be reformulated as a sequential-decision-making problem.
We propose an efficient approximation scheme using learned bipartite graphs as surrogate utility models.
arXiv Detail & Related papers (2025-02-06T23:03:10Z) - Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference [55.150117654242706]
We show that model selection for computation-aware GPs trained on 1.8 million data points can be done within a few hours on a single GPU.
As a result of this work, Gaussian processes can be trained on large-scale datasets without significantly compromising their ability to quantify uncertainty.
arXiv Detail & Related papers (2024-11-01T21:11:48Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning.
We construct pseudo-skill clusters by grouping gradient-based sample vectors.
We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - Machine learning with incomplete datasets using multi-objective
optimization models [1.933681537640272]
We propose an online approach to handle missing values while a classification model is learnt.
We develop a multi-objective optimization model with two objective functions for imputation and model selection.
We use an evolutionary algorithm based on NSGA II to find the optimal solutions.
arXiv Detail & Related papers (2020-12-04T03:44:33Z) - Joint Adaptive Graph and Structured Sparsity Regularization for
Unsupervised Feature Selection [6.41804410246642]
We propose a joint adaptive graph and structured sparsity regularization unsupervised feature selection (JASFS) method.
A subset of optimal features will be selected in group, and the number of selected features will be determined automatically.
Experimental results on eight benchmarks demonstrate the effectiveness and efficiency of the proposed method.
arXiv Detail & Related papers (2020-10-09T08:17:04Z) - Monotonic Cardinality Estimation of Similarity Selection: A Deep
Learning Approach [22.958342743597044]
We investigate the possibilities of utilizing deep learning for cardinality estimation of similarity selection.
We propose a novel and generic method that can be applied to any data type and distance function.
arXiv Detail & Related papers (2020-02-15T20:22:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.