DistJoin: A Decoupled Join Cardinality Estimator based on Adaptive Neural Predicate Modulation
- URL: http://arxiv.org/abs/2503.08994v3
- Date: Tue, 09 Sep 2025 14:00:23 GMT
- Title: DistJoin: A Decoupled Join Cardinality Estimator based on Adaptive Neural Predicate Modulation
- Authors: Kaixin Zhang, Hongzhi Wang, Ziqi Li, Yabin Lu, Yingze Li, Yu Yan, Yiming Guan,
- Abstract summary: We introduce DistJoin, a join cardinality estimator based on efficient distribution prediction using multi-autoregressive models.<n>We demonstrate that DistJoin mitigates the highest accuracy, robustness to data updates, generality, and comparable update and inference speed relative to existing methods.
- Score: 11.506172695239576
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Research on learned cardinality estimation has made significant progress in recent years. However, existing methods still face distinct challenges that hinder their practical deployment in production environments. We define these challenges as the ``Trilemma of Cardinality Estimation'', where learned cardinality estimation methods struggle to balance generality, accuracy, and updatability. To address these challenges, we introduce DistJoin, a join cardinality estimator based on efficient distribution prediction using multi-autoregressive models. Our contributions are threefold: (1) We propose a method to estimate join cardinality by leveraging the probability distributions of individual tables in a decoupled manner. (2) To meet the requirements of efficiency for DistJoin, we develop Adaptive Neural Predicate Modulation (ANPM), a high-throughput distribution estimation model. (3) We demonstrate that an existing similar approach suffers from variance accumulation issues by formal variance analysis. To mitigate this problem, DistJoin employs a selectivity-based approach to infer join cardinality, effectively reducing variance. In summary, DistJoin not only represents the first data-driven method to support both equi and non-equi joins simultaneously but also demonstrates superior accuracy while enabling fast and flexible updates. The experimental results demonstrate that DistJoin achieves the highest accuracy, robustness to data updates, generality, and comparable update and inference speed relative to existing methods.
Related papers
- DANCE: Doubly Adaptive Neighborhood Conformal Estimation [12.643121779828526]
We propose a doubly locally adaptive nearest-neighbor based conformal algorithm combining two novel nonconformity scores directly using the data's embedded representation.<n>We test against state-of-the-art local, task-adapted and zero-shot conformal baselines, demonstrating DANCE's superior blend of set size efficiency and robustness across various datasets.
arXiv Detail & Related papers (2026-02-24T07:54:53Z) - CoLSE: A Lightweight and Robust Hybrid Learned Model for Single-Table Cardinality Estimation using Joint CDF [7.945011337356916]
Cardinality estimation is a critical component of query optimization.<n>We propose CoLSE, a hybrid learned approach for single-table cardinality estimation.<n> Experimental results show that CoLSE achieves a favorable trade-off among accuracy, training time, latency, and model size, outperforming existing state-of-the-art methods.
arXiv Detail & Related papers (2025-12-14T10:08:20Z) - Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models [64.92045568376705]
Coherent Contextual Decoding (CCD) is a novel inference framework built upon two core innovations.<n>CCD employs a trajectory rectification mechanism that leverages historical context to enhance sequence coherence.<n>Instead of rigid allocations based on diffusion steps, we introduce an adaptive sampling strategy that dynamically adjusts the unmasking budget for each step.
arXiv Detail & Related papers (2025-11-26T09:49:48Z) - A Lightweight Learned Cardinality Estimation Model [12.33987311866824]
We propose a novel data-driven approach called CoDe (Covering with Decompositions) to address the cardinality estimation problem.<n>CoDe employs the concept of covering design, which divides the table into multiple smaller, overlapping segments.<n>By employing multiple models to approximate distributions, CoDe excels in effectively modeling discrete distributions and ensuring computational efficiency.
arXiv Detail & Related papers (2025-08-13T08:34:58Z) - NAN: A Training-Free Solution to Coefficient Estimation in Model Merging [61.36020737229637]
We show that the optimal merging weights should scale with the amount of task-specific information encoded in each model.<n>We propose NAN, a simple yet effective method that estimates model merging coefficients via the inverse of parameter norm.<n>NAN is training-free, plug-and-play, and applicable to a wide range of merging strategies.
arXiv Detail & Related papers (2025-05-22T02:46:08Z) - Less is More: Efficient Black-box Attribution via Minimal Interpretable Subset Selection [52.716143424856185]
We propose LiMA (Less input is More faithful for Attribution), which reformulates the attribution of important regions as an optimization problem for submodular subset selection.
LiMA identifies both the most and least important samples while ensuring an optimal attribution boundary that minimizes errors.
Our method also outperforms the greedy search in attribution efficiency, being 1.6 times faster.
arXiv Detail & Related papers (2025-04-01T06:58:15Z) - Preference Construction: A Bayesian Interactive Preference Elicitation Framework Based on Monte Carlo Tree Search [6.473114631834851]
We present a novel preference learning framework to capture participant preferences efficiently within limited interaction rounds.
First, we develop a variational Bayesian approach to infer the participant's preference model.
Second, we propose an adaptive questioning policy that maximizes cumulative uncertainty reduction.
Third, we apply the framework to Multiple Criteria Decision Aiding, with pairwise comparison as the preference information.
arXiv Detail & Related papers (2025-03-19T12:16:54Z) - EquivaMap: Leveraging LLMs for Automatic Equivalence Checking of Optimization Formulations [12.962019992859531]
We introduce quasi-Karp equivalence, a formal criterion for determining when two optimization formulations are equivalent.<n>We propose EquivaMap, a framework that leverages large language models to automatically discover such mappings.
arXiv Detail & Related papers (2025-02-20T17:35:32Z) - Prompt-Matcher: Leveraging Large Models to Reduce Uncertainty in Schema Matching Results [1.13107643869251]
We introduce a new approach based on fine-grained correspondence verification with specific prompt of Large Language Model.<n>Our approach is an iterative loop that consists of three main components: (1) the correspondence selection algorithm, (2) correspondence verification, and (3) the update of probability distribution.<n>We propose a novel $(1-1/e)$-approximation algorithm that significantly outperforms brute algorithm in terms of computational efficiency.
arXiv Detail & Related papers (2024-08-24T16:54:08Z) - Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple
Logits Retargeting Approach [102.0769560460338]
We develop a simple logits approach (LORT) without the requirement of prior knowledge of the number of samples per class.
Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018.
arXiv Detail & Related papers (2024-03-01T03:27:08Z) - Duet: efficient and scalable hybriD neUral rElation undersTanding [9.231883521214241]
Duet is a stable, efficient, and scalable hybrid method to estimate cardinality directly without sampling or any non-differentiable process.
Duet has a lower inference cost on CPU than that of most learned methods on GPU.
arXiv Detail & Related papers (2023-07-25T13:42:22Z) - Learning Unnormalized Statistical Models via Compositional Optimization [73.30514599338407]
Noise-contrastive estimation(NCE) has been proposed by formulating the objective as the logistic loss of the real data and the artificial noise.
In this paper, we study it a direct approach for optimizing the negative log-likelihood of unnormalized models.
arXiv Detail & Related papers (2023-06-13T01:18:16Z) - Scardina: Scalable Join Cardinality Estimation by Multiple Density
Estimators [8.641606056228675]
Machine learning-based cardinality estimation methods are replacing traditional methods.
We propose Scardina, a new join cardinality estimation method using multiple partitioned models based on the schema structure.
arXiv Detail & Related papers (2023-03-31T13:22:28Z) - Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem.
Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem.
We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z) - Control as Hybrid Inference [62.997667081978825]
We present an implementation of CHI which naturally mediates the balance between iterative and amortised inference.
We verify the scalability of our algorithm on a continuous control benchmark, demonstrating that it outperforms strong model-free and model-based baselines.
arXiv Detail & Related papers (2020-07-11T19:44:09Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z) - GenDICE: Generalized Offline Estimation of Stationary Values [108.17309783125398]
We show that effective estimation can still be achieved in important applications.
Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions.
The resulting algorithm, GenDICE, is straightforward and effective.
arXiv Detail & Related papers (2020-02-21T00:27:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.