DMCD: Semantic-Statistical Framework for Causal Discovery
- URL: http://arxiv.org/abs/2602.20333v1
- Date: Mon, 23 Feb 2026 20:29:35 GMT
- Title: DMCD: Semantic-Statistical Framework for Causal Discovery
- Authors: Samarth KaPatel, Sofia Nikiforova, Giacinto Paolo Saggese, Paul Smith,
- Abstract summary: We present DMCD, a causal discovery framework that integrates semantic drafting from variable metadata with statistical validation on observational data.<n>We evaluate our approach on three metadata-rich real-world benchmarks spanning industrial engineering, environmental monitoring, and IT systems analysis.
- Score: 0.03499870393443267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present DMCD (DataMap Causal Discovery), a two-phase causal discovery framework that integrates LLM-based semantic drafting from variable metadata with statistical validation on observational data. In Phase I, a large language model proposes a sparse draft DAG, serving as a semantically informed prior over the space of possible causal structures. In Phase II, this draft is audited and refined via conditional independence testing, with detected discrepancies guiding targeted edge revisions. We evaluate our approach on three metadata-rich real-world benchmarks spanning industrial engineering, environmental monitoring, and IT systems analysis. Across these datasets, DMCD achieves competitive or leading performance against diverse causal discovery baselines, with particularly large gains in recall and F1 score. Probing and ablation experiments suggest that these improvements arise from semantic reasoning over metadata rather than memorization of benchmark graphs. Overall, our results demonstrate that combining semantic priors with principled statistical verification yields a high-performing and practically effective approach to causal structure learning.
Related papers
- Reasoning-Driven Multimodal LLM for Domain Generalization [72.00754603114187]
We study the role of reasoning in domain generalization using DomainBed-Reasoning dataset.<n>We propose RD-MLDG, a framework with two components: MTCT (Multi-Task Cross-Training) and SARR (Self-Aligned Reasoning Regularization)<n>Experiments on standard DomainBed datasets demonstrate that RD-MLDG achieves complementary state-of-the-art performances.
arXiv Detail & Related papers (2026-02-27T08:10:06Z) - STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction [78.0692157478247]
We propose STAR, a framework that bridges data-driven STatistical expectations with knowledge-driven Agentic Reasoning.<n>We show that STAR consistently outperforms all baselines on both score-based and rank-based metrics.
arXiv Detail & Related papers (2026-02-12T16:30:07Z) - Localized Kernel Projection Outlyingness: A Two-Stage Approach for Multi-Modal Outlier Detection [0.0]
Two-Stage LKPLO is a novel multi-stage outlier detection framework.<n>It overcomes the coexisting limitations of conventional projection-based methods.<n>It achieves state-of-the-art performance on challenging datasets.
arXiv Detail & Related papers (2025-10-28T03:53:46Z) - Relationship Detection on Tabular Data Using Statistical Analysis and Large Language Models [4.201987249923826]
This work experiments with a hybrid approach for detecting relationships using a Knowledge Graph (KG) as a reference point, a task known as CPA.<n>This approach leverages large language models (LLMs) while employing statistical analysis to reduce the search space of potential KG relations.<n>The experimental evaluation on two benchmark datasets provided by the SemTab challenge assesses the influence of each module and the effectiveness of different state-of-the-art LLMs.
arXiv Detail & Related papers (2025-06-04T12:11:05Z) - Financial Data Analysis with Robust Federated Logistic Regression [7.68275287892947]
In this study, we focus on the analysis of financial data in a federated setting, wherein data is distributed across multiple clients or locations.<n>We propose a robust federated logistic regression-based framework that strives to strike a balance between these goals.
arXiv Detail & Related papers (2025-04-28T20:42:24Z) - Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks.<n>We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z) - Multi-Agent Causal Discovery Using Large Language Models [10.020595983728482]
Causal discovery is a critical research area in machine learning.<n>We introduce the Multi-Agent Causal Discovery Framework (MAC)<n>It consists of two key modules: the Debate-Coding Module (DCM) and the Meta-Debate Module (MDM)
arXiv Detail & Related papers (2024-07-21T06:21:47Z) - DAGnosis: Localized Identification of Data Inconsistencies using
Structures [73.39285449012255]
Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models.
We use directed acyclic graphs (DAGs) to encode the training set's features probability distribution and independencies as a structure.
Our method, called DAGnosis, leverages these structural interactions to bring valuable and insightful data-centric conclusions.
arXiv Detail & Related papers (2024-02-26T11:29:16Z) - SSL Framework for Causal Inconsistency between Structures and Representations [31.895570222735955]
Cross-pollination between causal discovery and deep learning has led to increasingly extensive interactions.<n>Indefinite Data has conflicts between causal relationships expressed by the causal structure and causal representation generated by deep learning models.<n>To alleviate causal inconsistency, we proposed a self-supervised learning framework based on intervention.
arXiv Detail & Related papers (2023-10-28T08:29:49Z) - Multi-level Consistency Learning for Semi-supervised Domain Adaptation [85.90600060675632]
Semi-supervised domain adaptation (SSDA) aims to apply knowledge learned from a fully labeled source domain to a scarcely labeled target domain.
We propose a Multi-level Consistency Learning framework for SSDA.
arXiv Detail & Related papers (2022-05-09T06:41:18Z) - DRFLM: Distributionally Robust Federated Learning with Inter-client
Noise via Local Mixup [58.894901088797376]
federated learning has emerged as a promising approach for training a global model using data from multiple organizations without leaking their raw data.
We propose a general framework to solve the above two challenges simultaneously.
We provide comprehensive theoretical analysis including robustness analysis, convergence analysis, and generalization ability.
arXiv Detail & Related papers (2022-04-16T08:08:29Z) - Uncovering Main Causalities for Long-tailed Information Extraction [14.39860866665021]
Long-tailed distributions caused by the selection bias of a dataset may lead to incorrect correlations.
This motivates us to propose counterfactual IE (CFIE), a novel framework that aims to uncover the main causalities behind data.
arXiv Detail & Related papers (2021-09-11T08:08:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.