OCDB: Revisiting Causal Discovery with a Comprehensive Benchmark and Evaluation Framework
- URL: http://arxiv.org/abs/2406.04598v1
- Date: Fri, 7 Jun 2024 03:09:22 GMT
- Title: OCDB: Revisiting Causal Discovery with a Comprehensive Benchmark and Evaluation Framework
- Authors: Wei Zhou, Hong Huang, Guowen Zhang, Ruize Shi, Kehan Yin, Yuanyuan Lin, Bang Liu,
- Abstract summary: Causal discovery offers a promising approach to improve transparency and reliability.
We propose a flexible evaluation framework with metrics for evaluating differences in causal structures and causal effects.
We introduce the Open Causal Discovery Benchmark (OCDB), based on real data, to promote fair comparisons and drive optimization of algorithms.
- Score: 21.87740178652843
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have excelled in various natural language processing tasks, but challenges in interpretability and trustworthiness persist, limiting their use in high-stakes fields. Causal discovery offers a promising approach to improve transparency and reliability. However, current evaluations are often one-sided and lack assessments focused on interpretability performance. Additionally, these evaluations rely on synthetic data and lack comprehensive assessments of real-world datasets. These lead to promising methods potentially being overlooked. To address these issues, we propose a flexible evaluation framework with metrics for evaluating differences in causal structures and causal effects, which are crucial attributes that help improve the interpretability of LLMs. We introduce the Open Causal Discovery Benchmark (OCDB), based on real data, to promote fair comparisons and drive optimization of algorithms. Additionally, our new metrics account for undirected edges, enabling fair comparisons between Directed Acyclic Graphs (DAGs) and Completed Partially Directed Acyclic Graphs (CPDAGs). Experimental results show significant shortcomings in existing algorithms' generalization capabilities on real data, highlighting the potential for performance improvement and the importance of our framework in advancing causal discovery techniques.
Related papers
- Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework [2.4861619769660637]
We propose an estimands framework adapted from international clinical trials guidelines.
This framework provides a systematic structure for inference and reporting in evaluations.
We demonstrate how the framework can help uncover underlying issues, their causes, and potential solutions.
arXiv Detail & Related papers (2024-06-14T18:47:37Z) - A Correlation- and Mean-Aware Loss Function and Benchmarking Framework to Improve GAN-based Tabular Data Synthesis [2.2451409468083114]
We propose a novel correlation- and mean-aware loss function for generative adversarial networks (GANs)
The proposed loss function demonstrates statistically significant improvements over existing methods in capturing the true data distribution.
The benchmarking framework shows that the enhanced synthetic data quality leads to improved performance in downstream machine learning tasks.
arXiv Detail & Related papers (2024-05-27T09:08:08Z) - Overcoming Pitfalls in Graph Contrastive Learning Evaluation: Toward
Comprehensive Benchmarks [60.82579717007963]
We introduce an enhanced evaluation framework designed to more accurately gauge the effectiveness, consistency, and overall capability of Graph Contrastive Learning (GCL) methods.
arXiv Detail & Related papers (2024-02-24T01:47:56Z) - Reliability in Semantic Segmentation: Can We Use Synthetic Data? [69.28268603137546]
We show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models.
This synthetic data is employed to evaluate the robustness of pretrained segmenters.
We demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.
arXiv Detail & Related papers (2023-12-14T18:56:07Z) - Towards Robust Aspect-based Sentiment Analysis through
Non-counterfactual Augmentations [40.71705332298682]
We present an alternative approach that relies on non-counterfactual data augmentation.
Our approach further establishes a new state-of-the-art on the ABSA robustness benchmark and transfers well across domains.
arXiv Detail & Related papers (2023-06-24T13:57:32Z) - On Certifying and Improving Generalization to Unseen Domains [87.00662852876177]
Domain Generalization aims to learn models whose performance remains high on unseen domains encountered at test-time.
It is challenging to evaluate DG algorithms comprehensively using a few benchmark datasets.
We propose a universal certification framework that can efficiently certify the worst-case performance of any DG method.
arXiv Detail & Related papers (2022-06-24T16:29:43Z) - Doing Great at Estimating CATE? On the Neglected Assumptions in
Benchmark Comparisons of Treatment Effect Estimators [91.3755431537592]
We show that even in arguably the simplest setting, estimation under ignorability assumptions can be misleading.
We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators.
We highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others.
arXiv Detail & Related papers (2021-07-28T13:21:27Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z) - Towards Comparability in Non-Intrusive Load Monitoring: On Data and
Performance Evaluation [1.0312968200748116]
Non-Intrusive Load Monitoring (NILM) comprises of a set of techniques that provide insights into the energy consumption of households and industrial facilities.
Despite progress made concerning disaggregation techniques, performance evaluation and comparability remains an open research question.
Detailed information on pre-processing as well as data cleaning methods, the importance of unified performance reporting, and the need for complexity measures in load disaggregation are found to be the most urgent issues in NILM-related research.
arXiv Detail & Related papers (2020-01-20T10:13:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.