Related papers: Relational Deep Dive: Error-Aware Queries Over Unstructured Data

Relational Deep Dive: Error-Aware Queries Over Unstructured Data

URL: http://arxiv.org/abs/2511.02711v1
Date: Tue, 04 Nov 2025 16:30:55 GMT
Title: Relational Deep Dive: Error-Aware Queries Over Unstructured Data
Authors: Daren Chao, Kaiwen Chen, Naiqing Guan, Nick Koudas,
Abstract summary: ReDD (Relational Deep Dive) is a framework that dynamically discovers query-specific schemas, populates relational tables, and ensures error-aware extraction with provable guarantees.<n>Main contribution is SCAPE, a statistically calibrated method for error detection with coverage guarantees, and SCAPE-HYB, a hybrid approach that optimize the trade-off between accuracy and human correction costs.
Score: 9.0236658372663
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unstructured data is pervasive, but analytical queries demand structured representations, creating a significant extraction challenge. Existing methods like RAG lack schema awareness and struggle with cross-document alignment, leading to high error rates. We propose ReDD (Relational Deep Dive), a framework that dynamically discovers query-specific schemas, populates relational tables, and ensures error-aware extraction with provable guarantees. ReDD features a two-stage pipeline: (1) Iterative Schema Discovery (ISD) identifies minimal, joinable schemas tailored to each query, and (2) Tabular Data Population (TDP) extracts and corrects data using lightweight classifiers trained on LLM hidden states. A main contribution of ReDD is SCAPE, a statistically calibrated method for error detection with coverage guarantees, and SCAPE-HYB, a hybrid approach that optimizes the trade-off between accuracy and human correction costs. Experiments across diverse datasets demonstrate ReDD's effectiveness, reducing data extraction errors from up to 30% to below 1% while maintaining high schema completeness (100% recall) and precision. ReDD's modular design enables fine-grained control over accuracy-cost trade-offs, making it a robust solution for high-stakes analytical queries over unstructured corpora.

Related papers

Relatron: Automating Relational Machine Learning over Relational Databases [50.94254514286021]
We present a study that unifies RDL and DFS in a shared design space and conducts architecture-centric searches across diverse RDB tasks.<n>Our analysis yields three key findings: (1) RDL does not consistently outperform DFS, with performance being highly task-dependent; (2) no single architecture dominates across tasks, underscoring the need for task-aware model selection; and accuracy is an unreliable guide for choice architecture.
arXiv Detail & Related papers (2026-02-26T02:45:22Z)
From Few-Shot to Zero-Shot: Towards Generalist Graph Anomaly Detection [89.52759572485276]
ARC is a few-shot generalist GAD method that leverages in-context learning and requires only a few labeled normal samples at inference time.<n> ARC and ARC_zero effectively detect anomalies, exhibit strong generalization ability, and perform efficiently under few-shot and zero-shot settings.
arXiv Detail & Related papers (2026-02-21T10:59:00Z)
RIFT: A Scalable Methodology for LLM Accelerator Fault Assessment using Reinforcement Learning [0.0]
RIFT (Reinforcement Learning-guided Intelligent Fault Targeting) is a scalable framework that automates the discovery of minimal, high-impact fault scenarios.<n>RIFT transforms the complex search for worst-case faults into a sequential decision-making problem.
arXiv Detail & Related papers (2025-12-10T17:07:19Z)
Stress-Testing Causal Claims via Cardinality Repairs [11.043119484281531]
How robust is a causal claim to small, targeted modifications in the data?<n>We introduce SubCure, a framework for auditing via cardinality repairs.<n>We develop efficient algorithms that incorporate machine unlearning techniques to update causal estimates without retraining from scratch.
arXiv Detail & Related papers (2025-12-02T07:31:03Z)
RFOD: Random Forest-based Outlier Detection for Tabular Data [12.469208664014472]
Outlier detection is crucial for safeguarding data integrity in high-stakes domains such as cybersecurity, financial fraud detection, and healthcare.<n>textsfRFOD reframes anomaly detection as a feature-wise conditional reconstruction problem.<n>textsfRFOD consistently outperforms state-of-the-art baselines in detection accuracy.
arXiv Detail & Related papers (2025-10-09T19:02:12Z)
Audited Reasoning Refinement: Fine-Tuning Language Models via LLM-Guided Step-Wise Evaluation and Correction [1.41282143488996]
Training a task-specific small reasoning model is challenging when direct human supervision or high-quality labels are scarce.<n>We propose Reason-Refine-then-Align (R2tA), which turns refined model rationales into supervision for training task-specific reasoning models.
arXiv Detail & Related papers (2025-09-15T21:47:52Z)
Limited Reference, Reliable Generation: A Two-Component Framework for Tabular Data Generation in Low-Data Regimes [7.036974567001374]
ReFine is a framework that guides generation toward domain-specific feature distribution.<n>Experiments on various regression and classification benchmarks demonstrate that ReFine consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2025-09-12T04:34:46Z)
Distributionally Robust Optimization with Adversarial Data Contamination [49.89480853499918]
We focus on optimizing Wasserstein-1 DRO objectives for generalized linear models with convex Lipschitz loss functions.<n>Our primary contribution lies in a novel modeling framework that integrates robustness against training data contamination with robustness against distributional shifts.<n>This work establishes the first rigorous guarantees, supported by efficient computation, for learning under the dual challenges of data contamination and distributional shifts.
arXiv Detail & Related papers (2025-07-14T18:34:10Z)
Stress-Testing ML Pipelines with Adversarial Data Corruption [11.91482648083998]
Regulators now demand evidence that high-stakes systems can withstand realistic, interdependent errors.<n>We introduce SAVAGE, a framework that formally models data-quality issues through dependency graphs and flexible corruption templates.<n>Savanage employs a bi-level optimization approach to efficiently identify vulnerable data subpopulations and fine-tune corruption severity.
arXiv Detail & Related papers (2025-06-02T00:41:24Z)
Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control [52.405085773954596]
Retrieval-Augmented Generation has emerged as a powerful approach to mitigate large language model hallucinations.<n>Existing RAG frameworks often apply retrieval indiscriminately,leading to inefficiencies-over-retrieving.<n>We introduce a novel user-controllable RAG framework that enables dynamic adjustment of the accuracy-cost trade-off.
arXiv Detail & Related papers (2025-02-17T18:56:20Z)
Towards Generalizable Trajectory Prediction Using Dual-Level Representation Learning And Adaptive Prompting [107.4034346788744]
Existing vehicle trajectory prediction models struggle with generalizability, prediction uncertainties, and handling complex interactions.<n>We propose Perceiver with Register queries (PerReg+), a novel trajectory prediction framework that introduces: (1) Dual-Level Representation Learning via Self-Distillation (SD) and Masked Reconstruction (MR), capturing global context and fine-grained details; (2) Enhanced Multimodality using register-based queries and pretraining, eliminating the need for clustering and suppression; and (3) Adaptive Prompt Tuning during fine-tuning, freezing the main architecture and optimizing a small number of prompts for efficient adaptation.
arXiv Detail & Related papers (2025-01-08T20:11:09Z)
Factual Error Correction for Abstractive Summaries Using Entity Retrieval [57.01193722520597]
We propose an efficient factual error correction system RFEC based on entities retrieval post-editing process. RFEC retrieves the evidence sentences from the original document by comparing the sentences with the target summary. Next, RFEC detects the entity-level errors in the summaries by considering the evidence sentences and substitutes the wrong entities with the accurate entities from the evidence sentences.
arXiv Detail & Related papers (2022-04-18T11:35:02Z)
Identifying Statistical Bias in Dataset Replication [102.92137353938388]
We study a replication of the ImageNet dataset on which models exhibit a significant (11-14%) drop in accuracy. After correcting for the identified statistical bias, only an estimated $3.6% pm 1.5%$ of the original $11.7% pm 1.0%$ accuracy drop remains unaccounted for.
arXiv Detail & Related papers (2020-05-19T17:48:32Z)
SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection [63.253850875265115]
Outlier detection (OD) is a key machine learning (ML) task for identifying abnormal objects from general samples. We propose a modular acceleration system, called SUOD, to address it.
arXiv Detail & Related papers (2020-03-11T00:22:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.