Learning From Software Failures: A Case Study at a National Space Research Center
- URL: http://arxiv.org/abs/2509.06301v2
- Date: Mon, 15 Sep 2025 15:30:20 GMT
- Title: Learning From Software Failures: A Case Study at a National Space Research Center
- Authors: Dharun Anandayuvaraj, Tanmay Singla, Zain Hammadeh, Andreas Lund, Alexandra Holloway, James C. Davis,
- Abstract summary: We conduct a case study through 10 in-depth interviews with research software engineers at a national space research center.<n>We examine how they learn from failures: how they gather, document, share, and apply lessons.<n>Our findings provide insight into how engineers learn from failures in practice.
- Score: 38.518223399280835
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Software failures can have significant consequences, making learning from failures a critical aspect of software engineering. While software organizations are recommended to conduct postmortems, the effectiveness and adoption of these practices vary widely. Understanding how engineers gather, document, share, and apply lessons from failures is essential for improving reliability and preventing recurrence. High-reliability organizations (HROs) often develop software systems where failures carry catastrophic risks, requiring continuous learning to ensure reliability. These organizations provide a valuable setting to examine practices and challenges for learning from software failures. Such insight could help develop processes and tools to improve reliability and prevent recurrence. However, we lack in-depth industry perspectives on the practices and challenges of learning from failures. To address this gap, we conducted a case study through 10 in-depth interviews with research software engineers at a national space research center. We examine how they learn from failures: how they gather, document, share, and apply lessons. To assess transferability, we include data from 5 additional interviews at other HROs. Our findings provide insight into how engineers learn from failures in practice. To summarize: (1) failure learning is informal, ad hoc, and inconsistently integrated into SDLC; (2) recurring failures persist due to absence of structured processes; and (3) key challenges, including time constraints, knowledge loss from turnover and fragmented documentation, and weak process enforcement, undermine systematic learning. Our findings deepen understanding of how software engineers learn from failures and offer guidance for improving failure management practices.
Related papers
- "So Am I Dr. Frankenstein? Or Were You a Monster the Whole Time?": Mitigating Software Project Failure With Loss-Aversion-Aware Development Methodologies [0.3626013617212666]
We conduct a study of the experiences of 600 software engineers in the UK and USA on project success experiences.<n> Empirical evaluation finds that approaches like ensuring clear requirements before the start of development, when loss aversion is at its lowest, correlated to 97% higher project success.
arXiv Detail & Related papers (2024-10-28T02:57:17Z) - Learning From Lessons Learned: Preliminary Findings From a Study of
Learning From Failure [3.045851438458641]
Organizations analyze and learn from system failures.
Co-evolve both the technical and human parts of their systems based on what they learn.
Despite established processes and tool support, it is not straightforward to take what was learned from a failure and successfully improve the reliability of the socio-technical system.
arXiv Detail & Related papers (2024-02-14T19:29:04Z) - Applying Machine Learning Analysis for Software Quality Test [0.0]
It is critical to comprehend what triggers maintenance and if it may be predicted.
Numerous methods of assessing the complexity of created programs may produce useful prediction models.
In this paper, the machine learning is applied on the available data to calculate the cumulative software failure levels.
arXiv Detail & Related papers (2023-05-16T06:10:54Z) - Security for Machine Learning-based Software Systems: a survey of
threats, practices and challenges [0.76146285961466]
How to securely develop the machine learning-based modern software systems (MLBSS) remains a big challenge.
latent vulnerabilities and privacy issues exposed to external users and attackers will be largely neglected and hard to be identified.
We consider that security for machine learning-based software systems may arise from inherent system defects or external adversarial attacks.
arXiv Detail & Related papers (2022-01-12T23:20:25Z) - Technology Readiness Levels for Machine Learning Systems [107.56979560568232]
Development and deployment of machine learning systems can be executed easily with modern tools, but the process is typically rushed and means-to-an-end.
We have developed a proven systems engineering approach for machine learning development and deployment.
Our "Machine Learning Technology Readiness Levels" framework defines a principled process to ensure robust, reliable, and responsible systems.
arXiv Detail & Related papers (2021-01-11T15:54:48Z) - Dos and Don'ts of Machine Learning in Computer Security [74.1816306998445]
Despite great potential, machine learning in security is prone to subtle pitfalls that undermine its performance.
We identify common pitfalls in the design, implementation, and evaluation of learning-based security systems.
We propose actionable recommendations to support researchers in avoiding or mitigating the pitfalls where possible.
arXiv Detail & Related papers (2020-10-19T13:09:31Z) - Technology Readiness Levels for AI & ML [79.22051549519989]
Development of machine learning systems can be executed easily with modern tools, but the process is typically rushed and means-to-an-end.
Engineering systems follow well-defined processes and testing standards to streamline development for high-quality, reliable results.
We propose a proven systems engineering approach for machine learning development and deployment.
arXiv Detail & Related papers (2020-06-21T17:14:34Z) - Towards CRISP-ML(Q): A Machine Learning Process Model with Quality
Assurance Methodology [53.063411515511056]
We propose a process model for the development of machine learning applications.
The first phase combines business and data understanding as data availability oftentimes affects the feasibility of the project.
The sixth phase covers state-of-the-art approaches for monitoring and maintenance of a machine learning applications.
arXiv Detail & Related papers (2020-03-11T08:25:49Z) - Curriculum Learning for Reinforcement Learning Domains: A Framework and
Survey [53.73359052511171]
Reinforcement learning (RL) is a popular paradigm for addressing sequential decision tasks in which the agent has only limited environmental feedback.
We present a framework for curriculum learning (CL) in RL, and use it to survey and classify existing CL methods in terms of their assumptions, capabilities, and goals.
arXiv Detail & Related papers (2020-03-10T20:41:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.