Related papers: OCDB: Revisiting Causal Discovery with a Comprehensive Benchmark and Evaluation Framework

OCDB: Revisiting Causal Discovery with a Comprehensive Benchmark and Evaluation Framework

URL: http://arxiv.org/abs/2406.04598v1
Date: Fri, 7 Jun 2024 03:09:22 GMT
Title: OCDB: Revisiting Causal Discovery with a Comprehensive Benchmark and Evaluation Framework
Authors: Wei Zhou, Hong Huang, Guowen Zhang, Ruize Shi, Kehan Yin, Yuanyuan Lin, Bang Liu,
Abstract summary: Causal discovery offers a promising approach to improve transparency and reliability. We propose a flexible evaluation framework with metrics for evaluating differences in causal structures and causal effects. We introduce the Open Causal Discovery Benchmark (OCDB), based on real data, to promote fair comparisons and drive optimization of algorithms.
Score: 21.87740178652843
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have excelled in various natural language processing tasks, but challenges in interpretability and trustworthiness persist, limiting their use in high-stakes fields. Causal discovery offers a promising approach to improve transparency and reliability. However, current evaluations are often one-sided and lack assessments focused on interpretability performance. Additionally, these evaluations rely on synthetic data and lack comprehensive assessments of real-world datasets. These lead to promising methods potentially being overlooked. To address these issues, we propose a flexible evaluation framework with metrics for evaluating differences in causal structures and causal effects, which are crucial attributes that help improve the interpretability of LLMs. We introduce the Open Causal Discovery Benchmark (OCDB), based on real data, to promote fair comparisons and drive optimization of algorithms. Additionally, our new metrics account for undirected edges, enabling fair comparisons between Directed Acyclic Graphs (DAGs) and Completed Partially Directed Acyclic Graphs (CPDAGs). Experimental results show significant shortcomings in existing algorithms' generalization capabilities on real data, highlighting the potential for performance improvement and the importance of our framework in advancing causal discovery techniques.

Related papers

Uncovering Bias Paths with LLM-guided Causal Discovery: An Active Learning and Dynamic Scoring Approach [1.5498930424110338]
Large Language Models (LLMs) offer a promising complement to statistical Causal Discovery (CD) approaches.<n> Ensuring fairness in machine learning requires understanding how sensitive attributes causally influence outcomes.<n>We propose a hybrid LLM-based framework for CD that extends a breadth-first search (BFS) strategy with active learning and dynamic scoring.
arXiv Detail & Related papers (2025-06-13T21:04:03Z)
Hallucination Detection in LLMs via Topological Divergence on Attention Graphs [64.74977204942199]
Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models. We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting.
arXiv Detail & Related papers (2025-04-14T10:06:27Z)
Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications. Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z)
Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z)
The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? [1.3810901729134184]
Large Language Models (LLMs) excel at standardized tests while failing to demonstrate genuine language understanding and adaptability. Our systematic analysis of NLP evaluation frameworks reveals pervasive vulnerabilities across the evaluation spectrum. We lay the groundwork for new evaluation methods that resist manipulation, minimize data contamination, and assess domain-specific tasks.
arXiv Detail & Related papers (2024-12-02T20:49:21Z)
Optimisation Strategies for Ensuring Fairness in Machine Learning: With and Without Demographics [4.662958544712181]
This paper introduces two formal frameworks to tackle open questions in machine learning fairness. In one framework, operator-valued optimisation and min-max objectives are employed to address unfairness in time-series problems. In the second framework, the challenge of lacking sensitive attributes, such as gender and race, in commonly used datasets is addressed.
arXiv Detail & Related papers (2024-11-13T22:29:23Z)
Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs) Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws. Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z)
Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework [2.4861619769660637]
We propose an estimands framework adapted from international clinical trials guidelines. This framework provides a systematic structure for inference and reporting in evaluations. We demonstrate how the framework can help uncover underlying issues, their causes, and potential solutions.
arXiv Detail & Related papers (2024-06-14T18:47:37Z)
A Correlation- and Mean-Aware Loss Function and Benchmarking Framework to Improve GAN-based Tabular Data Synthesis [2.2451409468083114]
We propose a novel correlation- and mean-aware loss function for generative adversarial networks (GANs) The proposed loss function demonstrates statistically significant improvements over existing methods in capturing the true data distribution. The benchmarking framework shows that the enhanced synthetic data quality leads to improved performance in downstream machine learning tasks.
arXiv Detail & Related papers (2024-05-27T09:08:08Z)
Overcoming Pitfalls in Graph Contrastive Learning Evaluation: Toward Comprehensive Benchmarks [60.82579717007963]
We introduce an enhanced evaluation framework designed to more accurately gauge the effectiveness, consistency, and overall capability of Graph Contrastive Learning (GCL) methods.
arXiv Detail & Related papers (2024-02-24T01:47:56Z)
Reliability in Semantic Segmentation: Can We Use Synthetic Data? [69.28268603137546]
We show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models. This synthetic data is employed to evaluate the robustness of pretrained segmenters. We demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.
arXiv Detail & Related papers (2023-12-14T18:56:07Z)
Towards Robust Aspect-based Sentiment Analysis through Non-counterfactual Augmentations [40.71705332298682]
We present an alternative approach that relies on non-counterfactual data augmentation. Our approach further establishes a new state-of-the-art on the ABSA robustness benchmark and transfers well across domains.
arXiv Detail & Related papers (2023-06-24T13:57:32Z)
On Certifying and Improving Generalization to Unseen Domains [87.00662852876177]
Domain Generalization aims to learn models whose performance remains high on unseen domains encountered at test-time. It is challenging to evaluate DG algorithms comprehensively using a few benchmark datasets. We propose a universal certification framework that can efficiently certify the worst-case performance of any DG method.
arXiv Detail & Related papers (2022-06-24T16:29:43Z)
Doing Great at Estimating CATE? On the Neglected Assumptions in Benchmark Comparisons of Treatment Effect Estimators [91.3755431537592]
We show that even in arguably the simplest setting, estimation under ignorability assumptions can be misleading. We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators. We highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others.
arXiv Detail & Related papers (2021-07-28T13:21:27Z)
Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management. We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)
Towards Comparability in Non-Intrusive Load Monitoring: On Data and Performance Evaluation [1.0312968200748116]
Non-Intrusive Load Monitoring (NILM) comprises of a set of techniques that provide insights into the energy consumption of households and industrial facilities. Despite progress made concerning disaggregation techniques, performance evaluation and comparability remains an open research question. Detailed information on pre-processing as well as data cleaning methods, the importance of unified performance reporting, and the need for complexity measures in load disaggregation are found to be the most urgent issues in NILM-related research.
arXiv Detail & Related papers (2020-01-20T10:13:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.