On Aggregation Queries over Predicted Nearest Neighbors
- URL: http://arxiv.org/abs/2502.18803v1
- Date: Wed, 26 Feb 2025 04:17:32 GMT
- Title: On Aggregation Queries over Predicted Nearest Neighbors
- Authors: Carrie Wang, Sihem Amer-Yahia, Laks V. S. Lakshmanan, Reynold Cheng,
- Abstract summary: We introduce AQNNs, a novel type of aggregation queries over the predicted neighborhood of a designated object.<n>AQNNs are prevalent in modern applications where, for instance, a medical professional may want to compute "the average systolic blood pressure of patients whose predicted condition is similar to a given insomnia patient"<n>Since prediction typically involves an expensive deep learning model or a human expert, we formulate query processing as the problem of returning an approximate aggregate.
- Score: 33.06696811081107
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Aggregation Queries over Nearest Neighbors (AQNNs), a novel type of aggregation queries over the predicted neighborhood of a designated object. AQNNs are prevalent in modern applications where, for instance, a medical professional may want to compute "the average systolic blood pressure of patients whose predicted condition is similar to a given insomnia patient". Since prediction typically involves an expensive deep learning model or a human expert, we formulate query processing as the problem of returning an approximate aggregate by combining an expensive oracle and a cheaper model (e.g, a simple ML model) to compute the predictions. We design the Sampler with Precision-Recall in Target (SPRinT) framework for answering AQNNs. SPRinT consists of sampling, nearest neighbor refinement, and aggregation, and is tailored for various aggregation functions. It enjoys provable theoretical guarantees, including bounds on sample size and on error in approximate aggregates. Our extensive experiments on medical, e-commerce, and video datasets demonstrate that SPRinT consistently achieves the lowest aggregation error with minimal computation cost compared to its baselines. Scalability results show that SPRinT's execution time and aggregation error remain stable as the dataset size increases, confirming its suitability for large-scale applications.
Related papers
- Adaptive Sampled Softmax with Inverted Multi-Index: Methods, Theory and Applications [79.53938312089308]
The MIDX-Sampler is a novel adaptive sampling strategy based on an inverted multi-index approach.<n>Our method is backed by rigorous theoretical analysis, addressing key concerns such as sampling bias, gradient bias, convergence rates, and generalization error bounds.
arXiv Detail & Related papers (2025-01-15T04:09:21Z) - Semiparametric conformal prediction [79.6147286161434]
Risk-sensitive applications require well-calibrated prediction sets over multiple, potentially correlated target variables.
We treat the scores as random vectors and aim to construct the prediction set accounting for their joint correlation structure.
We report desired coverage and competitive efficiency on a range of real-world regression problems.
arXiv Detail & Related papers (2024-11-04T14:29:02Z) - Deep Ensembles Meets Quantile Regression: Uncertainty-aware Imputation for Time Series [45.76310830281876]
We propose Quantile Sub-Ensembles, a novel method to estimate uncertainty with ensemble of quantile-regression-based task networks.
Our method not only produces accurate imputations that is robust to high missing rates, but also is computationally efficient due to the fast training of its non-generative model.
arXiv Detail & Related papers (2023-12-03T05:52:30Z) - Structured Radial Basis Function Network: Modelling Diversity for
Multiple Hypotheses Prediction [51.82628081279621]
Multi-modal regression is important in forecasting nonstationary processes or with a complex mixture of distributions.
A Structured Radial Basis Function Network is presented as an ensemble of multiple hypotheses predictors for regression problems.
It is proved that this structured model can efficiently interpolate this tessellation and approximate the multiple hypotheses target distribution.
arXiv Detail & Related papers (2023-09-02T01:27:53Z) - ALMERIA: Boosting pairwise molecular contrasts with scalable methods [0.0]
ALMERIA is a tool for estimating compound similarities and activity prediction based on pairwise molecular contrasts.
It has been implemented using scalable software and methods to exploit large volumes of data.
Experiments show state-of-the-art performance for molecular activity prediction.
arXiv Detail & Related papers (2023-04-28T16:27:06Z) - Adapting Neural Link Predictors for Data-Efficient Complex Query
Answering [45.961111441411084]
We propose a parameter-efficient score emphadaptation model optimised to re-calibrate neural link prediction scores for the complex query answering task.
CQD$mathcalA$ produces significantly more accurate results than current state-of-the-art methods.
arXiv Detail & Related papers (2023-01-29T00:17:16Z) - Predictive Querying for Autoregressive Neural Sequence Models [23.85426261235507]
We introduce a general typology for predictive queries in neural autoregressive sequence models.
We show that such queries can be systematically represented by sets of elementary building blocks.
We leverage this typology to develop new query estimation methods.
arXiv Detail & Related papers (2022-10-12T17:59:36Z) - Contrastive Neural Ratio Estimation for Simulation-based Inference [15.354874711988662]
Likelihood-to-evidence ratio estimation is usually cast as either a binary (NRE-A) or a multiclass (NRE-B) classification task.
In contrast to the binary classification framework, the current formulation of the multiclass version has an intrinsic and unknown bias term.
We propose a multiclass framework free from the bias inherent to NRE-B at optimum, leaving us in the position to run diagnostics that practitioners depend on.
arXiv Detail & Related papers (2022-10-11T00:12:51Z) - Optimal Clustering with Bandit Feedback [57.672609011609886]
This paper considers the problem of online clustering with bandit feedback.
It includes a novel stopping rule for sequential testing that circumvents the need to solve any NP-hard weighted clustering problem as its subroutines.
We show through extensive simulations on synthetic and real-world datasets that BOC's performance matches the lower boundally, and significantly outperforms a non-adaptive baseline algorithm.
arXiv Detail & Related papers (2022-02-09T06:05:05Z) - Electra: Conditional Generative Model based Predicate-Aware Query
Approximation [10.056919500568013]
ELECTRA is a predicate-aware AQP system that can answer analytics-style queries with a large number of predicates with much smaller approximation errors.
Our evaluations with four different baselines on three real-world datasets show that ELECTRA provides lower AQP error for large number of predicates compared to baselines.
arXiv Detail & Related papers (2022-01-28T21:13:26Z) - Predict then Interpolate: A Simple Algorithm to Learn Stable Classifiers [59.06169363181417]
Predict then Interpolate (PI) is an algorithm for learning correlations that are stable across environments.
We prove that by interpolating the distributions of the correct predictions and the wrong predictions, we can uncover an oracle distribution where the unstable correlation vanishes.
arXiv Detail & Related papers (2021-05-26T15:37:48Z) - Leverage Score Sampling for Complete Mode Coverage in Generative
Adversarial Networks [11.595070613477548]
A generative model may overlook underrepresented modes that are less frequent in the empirical data distribution.
We propose a sampling procedure based on ridge leverage scores which significantly improves mode coverage when compared to standard methods.
arXiv Detail & Related papers (2021-04-06T09:00:38Z) - Flexible Model Aggregation for Quantile Regression [92.63075261170302]
Quantile regression is a fundamental problem in statistical learning motivated by a need to quantify uncertainty in predictions.
We investigate methods for aggregating any number of conditional quantile models.
All of the models we consider in this paper can be fit using modern deep learning toolkits.
arXiv Detail & Related papers (2021-02-26T23:21:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.