Mitigating Omitted Variable Bias in Empirical Software Engineering
- URL: http://arxiv.org/abs/2501.17026v2
- Date: Fri, 19 Sep 2025 12:47:23 GMT
- Title: Mitigating Omitted Variable Bias in Empirical Software Engineering
- Authors: Carlo A. Furia, Richard Torkar,
- Abstract summary: omitted variable bias occurs when a statistical model leaves out variables that are relevant determinants of the effects under study.<n>Omitted variable bias presents a significant threat to the validity of empirical research.<n>This paper demonstrates a sequence of analysis steps that inform the design and execution of any empirical study in software engineering.
- Score: 2.9506547907696006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Omitted variable bias occurs when a statistical model leaves out variables that are relevant determinants of the effects under study. This results in the model attributing the missing variables' effect to some of the included variables -- hence over- or under-estimating the latter's true effect. Omitted variable bias presents a significant threat to the validity of empirical research, particularly in non-experimental studies such as those prevalent in empirical software engineering. This paper illustrates the impact of omitted variable bias on two case studies in the software engineering domain, and uses them to present methods to investigate the possible presence of omitted variable bias, to estimate its impact, and to mitigate its drawbacks. The analysis techniques we present are based on causal structural models of the variables of interest, which provide a practical, intuitive summary of the key relations among variables. This paper demonstrates a sequence of analysis steps that inform the design and execution of any empirical study in software engineering. An important observation is that it pays off to invest effort investigating omitted variable bias before actually executing an empirical study, because this effort can lead to a more solid study design, and to a significant reduction in its threats to validity.
Related papers
- Causal Graph Learning via Distributional Invariance of Cause-Effect Relationship [54.575090553659074]
We develop an algorithm that efficiently uncovers causal relationships with quadratic complexity in the number of observational variables.<n>Our experiments on a varied benchmark of large-scale datasets show superior or equivalent performance compared to existing works.
arXiv Detail & Related papers (2026-02-03T10:26:16Z) - Exploring the Garden of Forking Paths in Empirical Software Engineering Research: A Multiverse Analysis [3.6324565773746147]
We conduct a so-called multiverse analysis on a published empirical SE paper.<n>We identify nine pivotal analytical decisions with at least one equally defensible alternative.<n>The overwhelming majority produced qualitatively different, and sometimes even opposite, findings.
arXiv Detail & Related papers (2025-12-09T18:47:00Z) - Temporal Latent Variable Structural Causal Model for Causal Discovery under External Interferences [53.308122815325326]
We introduce latent variables to represent unobserved factors that affect the observed data.<n>Specifically, to capture the causal strength and adjacency information, we propose a new temporal latent variable structural causal model.<n>Considering that expert knowledge can provide information about unknown interferences in certain scenarios, we develop a method that facilitates the incorporation of prior knowledge into parameter learning.
arXiv Detail & Related papers (2025-11-13T07:10:10Z) - Data Fusion for Partial Identification of Causal Effects [62.56890808004615]
We propose a novel partial identification framework that enables researchers to answer key questions.<n>Is the causal effect positive or negative? and How severe must assumption violations be to overturn this conclusion?<n>We apply our framework to the Project STAR study, which investigates the effect of classroom size on students' third-grade standardized test performance.
arXiv Detail & Related papers (2025-05-30T07:13:01Z) - Achieving Fairness in Predictive Process Analytics via Adversarial Learning [50.31323204077591]
This paper addresses the challenge of integrating a debiasing phase into predictive business process analytics.
Our framework leverages on adversial debiasing is evaluated on four case studies, showing a significant reduction in the contribution of biased variables to the predicted value.
arXiv Detail & Related papers (2024-10-03T15:56:03Z) - Hypothesizing Missing Causal Variables with LLMs [55.28678224020973]
We formulate a novel task where the input is a partial causal graph with missing variables, and the output is a hypothesis about the missing variables to complete the partial graph.
We show the strong ability of LLMs to hypothesize the mediation variables between a cause and its effect.
We also observe surprising results where some of the open-source models outperform the closed GPT-4 model.
arXiv Detail & Related papers (2024-09-04T10:37:44Z) - Unsupervised Pairwise Causal Discovery on Heterogeneous Data using Mutual Information Measures [49.1574468325115]
Causal Discovery is a technique that tackles the challenge by analyzing the statistical properties of the constituent variables.
We question the current (possibly misleading) baseline results on the basis that they were obtained through supervised learning.
In consequence, we approach this problem in an unsupervised way, using robust Mutual Information measures.
arXiv Detail & Related papers (2024-08-01T09:11:08Z) - Causal Inference with Latent Variables: Recent Advances and Future Prospectives [43.04559575298597]
Causal inference (CI) aims to infer intrinsic causal relations among variables of interest.
The lack of observation of important variables severely compromises the reliability of CI methods.
Various consequences can be incurred if these latent variables are carelessly handled.
arXiv Detail & Related papers (2024-06-20T03:15:53Z) - A Second Look at the Impact of Passive Voice Requirements on Domain
Modeling: Bayesian Reanalysis of an Experiment [4.649794383775257]
We reanalyze the only known controlled experiment investigating the impact of passive voice on the subsequent activity of domain modeling.
Our results reveal that the effects observed by the original authors turned out to be much less significant than previously assumed.
arXiv Detail & Related papers (2024-02-16T16:24:00Z) - Identifiable Latent Polynomial Causal Models Through the Lens of Change [82.14087963690561]
Causal representation learning aims to unveil latent high-level causal representations from observed low-level data.<n>One of its primary tasks is to provide reliable assurance of identifying these latent causal models, known as identifiability.
arXiv Detail & Related papers (2023-10-24T07:46:10Z) - Nonlinearity, Feedback and Uniform Consistency in Causal Structural
Learning [0.8158530638728501]
Causal Discovery aims to find automated search methods for learning causal structures from observational data.
This thesis focuses on two questions in causal discovery: (i) providing an alternative definition of k-Triangle Faithfulness that (i) is weaker than strong faithfulness when applied to the Gaussian family of distributions, and (ii) under the assumption that the modified version of Strong Faithfulness holds.
arXiv Detail & Related papers (2023-08-15T01:23:42Z) - A Causal Framework for Decomposing Spurious Variations [68.12191782657437]
We develop tools for decomposing spurious variations in Markovian and Semi-Markovian models.
We prove the first results that allow a non-parametric decomposition of spurious effects.
The described approach has several applications, ranging from explainable and fair AI to questions in epidemiology and medicine.
arXiv Detail & Related papers (2023-06-08T09:40:28Z) - Identifying Weight-Variant Latent Causal Models [82.14087963690561]
We find that transitivity acts as a key role in impeding the identifiability of latent causal representations.
Under some mild assumptions, we can show that the latent causal representations can be identified up to trivial permutation and scaling.
We propose a novel method, termed Structural caUsAl Variational autoEncoder, which directly learns latent causal representations and causal relationships among them.
arXiv Detail & Related papers (2022-08-30T11:12:59Z) - A Critical Look At The Identifiability of Causal Effects with Deep
Latent Variable Models [2.326384409283334]
We use causal effect variational autoencoder (CEVAE) as a case study.
CEVAE seems to work reliably under some simple scenarios, but it does not identify the correct causal effect with a misspecified latent variable or a complex data distribution.
Our results show that the question of identifiability cannot be disregarded, and we argue that more attention should be paid to it in future work.
arXiv Detail & Related papers (2021-02-12T17:43:18Z) - Stable Prediction via Leveraging Seed Variable [73.9770220107874]
Previous machine learning methods might exploit subtly spurious correlations in training data induced by non-causal variables for prediction.
We propose a conditional independence test based algorithm to separate causal variables with a seed variable as priori, and adopt them for stable prediction.
Our algorithm outperforms state-of-the-art methods for stable prediction.
arXiv Detail & Related papers (2020-06-09T06:56:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.