How Do Community Smells Influence Self-Admitted Technical Debt in Machine Learning Projects?
- URL: http://arxiv.org/abs/2506.15884v2
- Date: Thu, 31 Jul 2025 20:18:47 GMT
- Title: How Do Community Smells Influence Self-Admitted Technical Debt in Machine Learning Projects?
- Authors: Shamse Tasnim Cynthia, Nuri Almarimi, Banani Roy,
- Abstract summary: We investigated the prevalence of community smells and their relationship with Self-Admitted Technical Debt (SATD) in open-source machine learning (ML) projects.<n>We found that community smells are widespread, exhibiting distinct distribution patterns across small, medium, and large projects.<n>Certain smells, such as Radio Silence and Organizational Silos, are strongly correlated with higher SATD occurrences.
- Score: 1.971759811837406
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Community smells reflect poor organizational practices that often lead to socio-technical issues and the accumulation of Self-Admitted Technical Debt (SATD). While prior studies have explored these problems in general software systems, their interplay in machine learning (ML)-based projects remains largely underexamined. In this study, we investigated the prevalence of community smells and their relationship with SATD in open-source ML projects, analyzing data at the release level. First, we examined the prevalence of ten community smell types across the releases of 155 ML-based systems and found that community smells are widespread, exhibiting distinct distribution patterns across small, medium, and large projects. Second, we detected SATD at the release level and applied statistical analysis to examine its correlation with community smells. Our results showed that certain smells, such as Radio Silence and Organizational Silos, are strongly correlated with higher SATD occurrences. Third, we considered the six identified types of SATD to determine which community smells are most associated with each debt category. Our analysis revealed authority- and communication-related smells often co-occur with persistent code and design debt. Finally, we analyzed how the community smells and SATD evolve over the releases, uncovering project size-dependent trends and shared trajectories. Our findings emphasize the importance of early detection and mitigation of socio-technical issues to maintain the long-term quality and sustainability of ML-based systems.
Related papers
- How Do Code Smells Affect Skill Growth in Scratch Novice Programmers? [3.8506666685467343]
The study will deliver the first large-scale, fine-grained map linking specific CT competencies to concrete design flaws and antipatterns.<n>By clarifying how programming habits influence early skill acquisition, the work advances both computing-education theory and practical tooling for sustainable software maintenance and evolution.
arXiv Detail & Related papers (2025-07-23T08:30:06Z) - Socio-Technical Smell Dynamics in Code Samples: A Multivocal Review on Emergence, Evolution, and Co-Occurrence [0.0]
Code samples play a pivotal role in open-source ecosystems (OSSECOs)<n>This study investigates how code and community smells emerge, co-occur, and evolve within code samples maintained in OSSECOs.
arXiv Detail & Related papers (2025-07-17T18:46:08Z) - Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery [52.616615506638205]
Novel class discovery (NCD) aims to cluster novel classes by leveraging knowledge from disjoint known classes.<n>We propose a novel framework named Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery (IICMVNCD)<n>IICMVNCD is the first attempt to explore NCD in multi-view setting so far.
arXiv Detail & Related papers (2025-07-16T08:42:52Z) - How Do Communities of ML-Enabled Systems Smell? A Cross-Sectional Study on the Prevalence of Community Smells [13.840177755312665]
We conducted an empirical study on 188 repositories from the NICHE dataset using the CADOCS tool to identify and analyze community smells.<n>We found that certain smells, such as Prima Donna Effects and Sharing Villainy, are more prevalent and fluctuate over time compared to others like Radio Silence or Organizational Skirmish.
arXiv Detail & Related papers (2025-04-24T10:23:37Z) - Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey [107.08019135783444]
Out-of-distribution (OOD) samples are crucial for ensuring the safety of machine learning systems.<n>Several other problems are closely related to OOD detection, including anomaly detection (AD), novelty detection (ND), open set recognition (OSR), and outlier detection (OD)
arXiv Detail & Related papers (2024-07-31T17:59:58Z) - A Comprehensive Survey of Contamination Detection Methods in Large Language Models [68.10605098856087]
With the rise of Large Language Models (LLMs) in recent years, abundant new opportunities are emerging, but also new challenges.<n>LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data.<n>This limitation jeopardizes real capability improvement in the field of NLP, yet there remains a lack of methods on how to efficiently detect contamination.
arXiv Detail & Related papers (2024-03-31T14:32:02Z) - VEC-SBM: Optimal Community Detection with Vectorial Edges Covariates [67.51637355249986]
We study an extension of the Block Model (SBM), a widely used statistical framework for community detection.
We propose a novel algorithm based on iterative refinement techniques and show that it optimally recovers the latent communities.
We rigorously assess the added value of leveraging edge's side information in the community detection process.
arXiv Detail & Related papers (2024-02-29T02:19:55Z) - Individual context-free online community health indicators fail to identify open source software sustainability [3.192308005611312]
We monitored thirty-eight open source projects over the period of a year.
None of the projects were abandoned during this period, and only one project entered a planned shutdown.
Results were highly heterogeneous, showing little commonality across documentation, mean response times for issues and code contributions, and available funding/staffing resources.
arXiv Detail & Related papers (2023-09-21T14:41:41Z) - Continuous Integration and Software Quality: A Causal Explanatory Study [0.46040036610482665]
Continuous Integration (CI) is a software engineering practice that aims to reduce the cost and risk of code integration among teams.
Recent empirical studies have confirmed associations between CI and the software quality (SQ)
arXiv Detail & Related papers (2023-09-18T23:10:34Z) - Locating Community Smells in Software Development Processes Using
Higher-Order Network Centralities [38.72139150402261]
Community smells are negative patterns in software development teams' interactions that impede their ability to create software.
Current approaches aim to detect community smells by analysing static network representations of software teams' interaction structures.
We show that higher-order network models provide a robust means of revealing such hidden patterns and complex relationships.
arXiv Detail & Related papers (2023-09-14T06:48:15Z) - Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs)
We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing.
We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z) - DCID: Deep Canonical Information Decomposition [84.59396326810085]
We consider the problem of identifying the signal shared between two one-dimensional target variables.
We propose ICM, an evaluation metric which can be used in the presence of ground-truth labels.
We also propose Deep Canonical Information Decomposition (DCID) - a simple, yet effective approach for learning the shared variables.
arXiv Detail & Related papers (2023-06-27T16:59:06Z) - Assessing Hidden Risks of LLMs: An Empirical Study on Robustness,
Consistency, and Credibility [37.682136465784254]
We conduct over a million queries to the mainstream large language models (LLMs) including ChatGPT, LLaMA, and OPT.
We find that ChatGPT is still capable to yield the correct answer even when the input is polluted at an extreme level.
We propose a novel index associated with a dataset that roughly decides the feasibility of using such data for LLM-involved evaluation.
arXiv Detail & Related papers (2023-05-15T15:44:51Z) - A game-theoretic analysis of networked system control for common-pool
resource management using multi-agent reinforcement learning [54.55119659523629]
Multi-agent reinforcement learning has recently shown great promise as an approach to networked system control.
Common-pool resources include arable land, fresh water, wetlands, wildlife, fish stock, forests and the atmosphere.
arXiv Detail & Related papers (2020-10-15T14:12:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.