Redundancy-Driven Top-$k$ Functional Dependency Discovery
- URL: http://arxiv.org/abs/2601.10130v1
- Date: Thu, 15 Jan 2026 07:16:01 GMT
- Title: Redundancy-Driven Top-$k$ Functional Dependency Discovery
- Authors: Xiaolong Wan, Xixian Han,
- Abstract summary: Functional dependencies (FDs) are basic constraints in relational databases and are used for many data management tasks.<n>Most FD discovery algorithms find all valid dependencies, but this causes two problems.<n>We propose SDP (Selective-Discovery-and-Prune), which discovers the top-$k$ FDs ranked by redundancy count.
- Score: 5.126404609788543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Functional dependencies (FDs) are basic constraints in relational databases and are used for many data management tasks. Most FD discovery algorithms find all valid dependencies, but this causes two problems. First, the computational cost is prohibitive: computational complexity grows quadratically with the number of tuples and exponentially with the number of attributes, making discovery slow on large-scale and high-dimensional data. Second, the result set can be huge, making it hard to identify useful dependencies. We propose SDP (Selective-Discovery-and-Prune), which discovers the top-$k$ FDs ranked by redundancy count. Redundancy count measures how much duplicated information an FD explains and connects directly to storage overhead and update anomalies. SDP uses an upper bound on redundancy to prune the search space. It is proved that this upper bound is monotone: adding attributes refines partitions and thus decreases the bound. Once the bound falls below the top-$k$ threshold, the entire branch can be skipped. We improve SDP with three optimizations: ordering attributes by partition cardinality, using pairwise statistics in a Partition Cardinality Matrix to tighten bounds, and a global scheduler to explore promising branches first. Experiments on over 40 datasets show that SDP is much faster and uses less memory than exhaustive methods.
Related papers
- Kernel Representation and Similarity Measure for Incomplete Data [55.62595187178638]
Measuring similarity between incomplete data is a fundamental challenge in web mining, recommendation systems, and user behavior analysis.<n>Traditional approaches either discard incomplete data or perform imputation as a preprocessing step, leading to information loss and biased similarity estimates.<n>This paper presents a new similarity measure that directly computes similarity between incomplete data in kernel feature space without explicit imputation in the original space.
arXiv Detail & Related papers (2025-10-15T09:41:23Z) - Divide by Question, Conquer by Agent: SPLIT-RAG with Question-Driven Graph Partitioning [62.640169289390535]
SPLIT-RAG is a multi-agent RAG framework that addresses the limitations with question-driven semantic graph partitioning and collaborative subgraph retrieval.<n>The innovative framework first create Semantic Partitioning of Linked Information, then use the Type-Specialized knowledge base to achieve Multi-Agent RAG.<n>The attribute-aware graph segmentation manages to divide knowledge graphs into semantically coherent subgraphs, ensuring subgraphs align with different query types.<n>A hierarchical merging module resolves inconsistencies across subgraph-derived answers through logical verifications.
arXiv Detail & Related papers (2025-05-20T06:44:34Z) - LIRA: A Learning-based Query-aware Partition Framework for Large-scale ANN Search [14.354312183838642]
In the query phase, a common strategy is to probe partitions based on the distance ranks of a query to partition centroids.<n>In the partition construction phase, all partition-based methods face the boundary problem that separates a query's nearest neighbors to multiple partitions.<n>We propose LIRA, a LearnIng-based queRy-aware pArtition framework.
arXiv Detail & Related papers (2025-03-30T12:03:57Z) - Optimal Rates for $O(1)$-Smooth DP-SCO with a Single Epoch and Large Batches [12.184984662899868]
We revisit the correlated convex optimization (SCO) problem.
We propose an algorithm that achieves the optimal rate for DP-SCO (up to polylog factors) in a single epoch.
arXiv Detail & Related papers (2024-06-04T18:59:42Z) - Efficiently Computing Similarities to Private Datasets [19.99000806126529]
Many methods in differentially private model training rely on computing the similarity between a query point (such as public or synthetic data) and private data.
We study the following fundamental algorithmic problem: Given a similarity function $f$ and a large high-dimensional private dataset $X subset mathbbRd$, output a differentially private (DP) data structure which approximates $sum_x in X f(x,y)$ for any query $y$.
arXiv Detail & Related papers (2024-03-13T19:19:19Z) - AcceleratedLiNGAM: Learning Causal DAGs at the speed of GPUs [57.12929098407975]
We show that by efficiently parallelizing existing causal discovery methods, we can scale them to thousands of dimensions.
Specifically, we focus on the causal ordering subprocedure in DirectLiNGAM and implement GPU kernels to accelerate it.
This allows us to apply DirectLiNGAM to causal inference on large-scale gene expression data with genetic interventions yielding competitive results.
arXiv Detail & Related papers (2024-03-06T15:06:11Z) - User-Level Differential Privacy With Few Examples Per User [73.81862394073308]
We consider the example-scarce regime, where each user has only a few examples, and obtain the following results.
For approximate-DP, we give a generic transformation of any item-level DP algorithm to a user-level DP algorithm.
We present a simple technique for adapting the exponential mechanism [McSherry, Talwar FOCS 2007] to the user-level setting.
arXiv Detail & Related papers (2023-09-21T21:51:55Z) - Scaling up Stochastic Gradient Descent for Non-convex Optimisation [5.908471365011942]
We propose a novel approach to the problem of shared parallel computation.
By combining two strategies into a unified framework, DPSGD is a better trade computation framework.
The potential gains can be achieved by DPSGD on a deep learning (DRL) problem (Latent Diletrichal inference) and on a deep learning (DRL) problem (advantage actor - A2C)
arXiv Detail & Related papers (2022-10-06T13:06:08Z) - Learning Aggregation Functions [78.47770735205134]
We introduce LAF (Learning Aggregation Functions), a learnable aggregator for sets of arbitrary cardinality.
We report experiments on semi-synthetic and real data showing that LAF outperforms state-of-the-art sum- (max-) decomposition architectures.
arXiv Detail & Related papers (2020-12-15T18:28:53Z) - Berrut Approximated Coded Computing: Straggler Resistance Beyond
Polynomial Computing [34.69732430310801]
We propose Berrut Approximated Coded Computing (BACC) as an alternative approach to deal with stragglers effect.
BACC is proven to be numerically stable with low computational complexity.
In particular, BACC is used to train a deep neural network on a cluster of servers.
arXiv Detail & Related papers (2020-09-17T14:23:38Z) - Exploiting Database Management Systems and Treewidth for Counting [22.315022989618594]
A canonical counting problem is #SAT, which asks to count the assignments of a Boolean formula.
Recent work shows that benchmarking instances for #SAT often have reasonably small treewidth.
We introduce a general framework to solve counting questions based on state-of-the-art database management systems.
arXiv Detail & Related papers (2020-01-13T12:45:22Z) - From Open Set to Closed Set: Supervised Spatial Divide-and-Conquer for
Object Counting [84.23313278891568]
We introduce the idea of spatial divide-and-Conquer Network (SS-DCNet) that transforms open-set counting into a closed-set problem.
SS-DCNet can only learn from a closed set but generalize well to open-set scenarios via S-DC.
We provide theoretical analyses as well as a controlled experiment on toy data, demonstrating why closed-set modeling makes sense.
arXiv Detail & Related papers (2020-01-07T04:36:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.