Towards Causal Analysis of Empirical Software Engineering Data: The
Impact of Programming Languages on Coding Competitions
- URL: http://arxiv.org/abs/2301.07524v6
- Date: Fri, 1 Sep 2023 12:42:09 GMT
- Title: Towards Causal Analysis of Empirical Software Engineering Data: The
Impact of Programming Languages on Coding Competitions
- Authors: Carlo A. Furia, Richard Torkar, Robert Feldt
- Abstract summary: This paper discusses some novel techniques based on structural causal models.
We apply these ideas to analyzing public data about programmer performance in Code Jam.
We find considerable differences between a purely associational and a causal analysis of the very same data.
- Score: 10.51554436183424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is abundant observational data in the software engineering domain,
whereas running large-scale controlled experiments is often practically
impossible. Thus, most empirical studies can only report statistical
correlations -- instead of potentially more insightful and robust causal
relations. To support analyzing purely observational data for causal relations,
and to assess any differences between purely predictive and causal models of
the same data, this paper discusses some novel techniques based on structural
causal models (such as directed acyclic graphs of causal Bayesian networks).
Using these techniques, one can rigorously express, and partially validate,
causal hypotheses; and then use the causal information to guide the
construction of a statistical model that captures genuine causal relations --
such that correlation does imply causation. We apply these ideas to analyzing
public data about programmer performance in Code Jam, a large world-wide coding
contest organized by Google every year. Specifically, we look at the impact of
different programming languages on a participant's performance in the contest.
While the overall effect associated with programming languages is weak compared
to other variables -- regardless of whether we consider correlational or causal
links -- we found considerable differences between a purely associational and a
causal analysis of the very same data. The takeaway message is that even an
imperfect causal analysis of observational data can help answer the salient
research questions more precisely and more robustly than with just purely
predictive techniques -- where genuine causal effects may be confounded.
Related papers
- Counterfactual Causal Inference in Natural Language with Large Language Models [9.153187514369849]
We propose an end-to-end causal structure discovery and causal inference method from natural language.
We first use an LLM to extract the instantiated causal variables from text data and build a causal graph.
We then conduct counterfactual inference on the estimated graph.
arXiv Detail & Related papers (2024-10-08T21:53:07Z) - CAnDOIT: Causal Discovery with Observational and Interventional Data from Time-Series [4.008958683836471]
CAnDOIT is a causal discovery method to reconstruct causal models using both observational and interventional data.
The use of interventional data in the causal analysis is crucial for real-world applications, such as robotics.
A Python implementation of CAnDOIT has also been developed and is publicly available on GitHub.
arXiv Detail & Related papers (2024-10-03T13:57:08Z) - CausalLP: Learning causal relations with weighted knowledge graph link prediction [5.3454230926797734]
CausalLP formulates the issue of incomplete causal networks as a knowledge graph completion problem.
The use of knowledge graphs to represent causal relations enables the integration of external domain knowledge.
Two primary tasks are supported by CausalLP: causal explanation and causal prediction.
arXiv Detail & Related papers (2024-04-23T20:50:06Z) - Sample, estimate, aggregate: A recipe for causal discovery foundation models [28.116832159265964]
We train a supervised model that learns to predict a larger causal graph from the outputs of classical causal discovery algorithms run over subsets of variables.
Our approach is enabled by the observation that typical errors in the outputs of classical methods remain comparable across datasets.
Experiments on real and synthetic data demonstrate that this model maintains high accuracy in the face of misspecification or distribution shift.
arXiv Detail & Related papers (2024-02-02T21:57:58Z) - Advancing Counterfactual Inference through Nonlinear Quantile Regression [77.28323341329461]
We propose a framework for efficient and effective counterfactual inference implemented with neural networks.
The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data.
Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions.
arXiv Detail & Related papers (2023-06-09T08:30:51Z) - DOMINO: Visual Causal Reasoning with Time-Dependent Phenomena [59.291745595756346]
We propose a set of visual analytics methods that allow humans to participate in the discovery of causal relations associated with windows of time delay.
Specifically, we leverage a well-established method, logic-based causality, to enable analysts to test the significance of potential causes.
Since an effect can be a cause of other effects, we allow users to aggregate different temporal cause-effect relations found with our method into a visual flow diagram.
arXiv Detail & Related papers (2023-03-12T03:40:21Z) - Measuring Causal Effects of Data Statistics on Language Model's
`Factual' Predictions [59.284907093349425]
Large amounts of training data are one of the major reasons for the high performance of state-of-the-art NLP models.
We provide a language for describing how training data influences predictions, through a causal framework.
Our framework bypasses the need to retrain expensive models and allows us to estimate causal effects based on observational data alone.
arXiv Detail & Related papers (2022-07-28T17:36:24Z) - Causal Regularization Using Domain Priors [23.31291916031858]
We propose a causal regularization method that can incorporate causal domain priors into the network.
We show that this approach can generalize to various kinds of specifications of causal priors.
On most datasets, domain-prior consistent models can be obtained without compromising on accuracy.
arXiv Detail & Related papers (2021-11-24T13:38:24Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Amortized Causal Discovery: Learning to Infer Causal Graphs from
Time-Series Data [63.15776078733762]
We propose Amortized Causal Discovery, a novel framework to learn to infer causal relations from time-series data.
We demonstrate experimentally that this approach, implemented as a variational model, leads to significant improvements in causal discovery performance.
arXiv Detail & Related papers (2020-06-18T19:59:12Z) - On Disentangled Representations Learned From Correlated Data [59.41587388303554]
We bridge the gap to real-world scenarios by analyzing the behavior of the most prominent disentanglement approaches on correlated data.
We show that systematically induced correlations in the dataset are being learned and reflected in the latent representations.
We also demonstrate how to resolve these latent correlations, either using weak supervision during training or by post-hoc correcting a pre-trained model with a small number of labels.
arXiv Detail & Related papers (2020-06-14T12:47:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.