Related papers: Accountability in Code Review: The Role of Intrinsic Drivers and the Impact of LLMs

Accountability in Code Review: The Role of Intrinsic Drivers and the Impact of LLMs

URL: http://arxiv.org/abs/2502.15963v1
Date: Fri, 21 Feb 2025 21:52:29 GMT
Title: Accountability in Code Review: The Role of Intrinsic Drivers and the Impact of LLMs
Authors: Adam Alami, Victor Vadmand Jensen, Neil A. Ernst,
Abstract summary: Key intrinsic drivers of accountability for code quality are personal standards, professional integrity, pride in code quality, and maintaining one's reputation.<n> introduction of AI into software engineering must preserve social integrity and collective accountability mechanisms.
Score: 6.841710924733614
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Accountability is an innate part of social systems. It maintains stability and ensures positive pressure on individuals' decision-making. As actors in a social system, software developers are accountable to their team and organization for their decisions. However, the drivers of accountability and how it changes behavior in software development are less understood. In this study, we look at how the social aspects of code review affect software engineers' sense of accountability for code quality. Since software engineering (SE) is increasingly involving Large Language Models (LLM) assistance, we also evaluate the impact on accountability when introducing LLM-assisted code reviews. We carried out a two-phased sequential qualitative study (interviews -> focus groups). In Phase I (16 interviews), we sought to investigate the intrinsic drivers of software engineers influencing their sense of accountability for code quality, relying on self-reported claims. In Phase II, we tested these traits in a more natural setting by simulating traditional peer-led reviews with focus groups and then LLM-assisted review sessions. We found that there are four key intrinsic drivers of accountability for code quality: personal standards, professional integrity, pride in code quality, and maintaining one's reputation. In a traditional peer-led review, we observed a transition from individual to collective accountability when code reviews are initiated. We also found that the introduction of LLM-assisted reviews disrupts this accountability process, challenging the reciprocity of accountability taking place in peer-led evaluations, i.e., one cannot be accountable to an LLM. Our findings imply that the introduction of AI into SE must preserve social integrity and collective accountability mechanisms.

Related papers

Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation [56.84819098277464]
CoNL is a framework that unifies generation, evaluation, and meta-evaluation through multi-agent self-play.<n>CoNL achieves consistent improvements over self-rewarding baselines while maintaining stable training.
arXiv Detail & Related papers (2026-01-29T09:41:14Z)
Engagement in Code Review: Emotional, Behavioral, and Cognitive Dimensions in Peer vs. LLM Interactions [4.311425473934521]
We study how software engineers engage in Large Language Model (LLM)-assisted code reviews compared to human peer-led reviews.<n>We identify self-regulation strategies that engineers use to regulate their emotions in response to negative feedback.<n>We show that LLM-assisted review redirects engagement from emotion management to cognitive load management.
arXiv Detail & Related papers (2025-12-04T23:09:24Z)
Code Review as Decision-Making -- Building a Cognitive Model from the Questions Asked During Code Review [2.8299846354183953]
We build a cognitive model of code review bottom up through thematic, statistical, temporal, and sequential analysis of the transcribed material.<n>The model shows how developers move through two phases during the code review; first an orientation phase to establish context and rationale, then an analytical phase to understand, assess, and plan the rest of the review.
arXiv Detail & Related papers (2025-07-13T14:04:16Z)
Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games [87.5673042805229]
How large language models balance self-interest and collective well-being is a critical challenge for ensuring alignment, robustness, and safe deployment.<n>We adapt a public goods game with institutional choice from behavioral economics, allowing us to observe how different LLMs navigate social dilemmas.<n>Surprisingly, we find that reasoning LLMs, such as the o1 series, struggle significantly with cooperation.
arXiv Detail & Related papers (2025-06-29T15:02:47Z)
The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z)
AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents [48.925168866726814]
AgentAuditor is a universal, training-free, memory-augmented reasoning framework.<n>ASSEBench is the first benchmark designed to check how well LLM-based evaluators can spot both safety risks and security threats.
arXiv Detail & Related papers (2025-05-31T17:10:23Z)
Are We on the Same Page? Examining Developer Perception Alignment in Open Source Code Reviews [2.66269503676104]
Code reviews are a critical aspect of open-source software (OSS) development, ensuring quality and fostering collaboration. This study examines perceptions, challenges, and biases in OSS code review processes, focusing on the perspectives of Contributors and maintainers.
arXiv Detail & Related papers (2025-04-25T15:03:39Z)
Do LLMs trust AI regulation? Emerging behaviour of game-theoretic LLM agents [61.132523071109354]
This paper investigates the interplay between AI developers, regulators and users, modelling their strategic choices under different regulatory scenarios. Our research identifies emerging behaviours of strategic AI agents, which tend to adopt more "pessimistic" stances than pure game-theoretic agents.
arXiv Detail & Related papers (2025-04-11T15:41:21Z)
Media and responsible AI governance: a game-theoretic and LLM analysis [61.132523071109354]
This paper investigates the interplay between AI developers, regulators, users, and the media in fostering trustworthy AI systems. Using evolutionary game theory and large language models (LLMs), we model the strategic interactions among these actors under different regulatory regimes.
arXiv Detail & Related papers (2025-03-12T21:39:38Z)
ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification [53.80183105328448]
Refine via Intrinsic Self-Verification (ReVISE) is an efficient framework that enables LLMs to self-correct their outputs through self-verification.<n>Our experiments on various reasoning tasks demonstrate that ReVISE achieves efficient self-correction and significantly improves reasoning performance.
arXiv Detail & Related papers (2025-02-20T13:50:02Z)
Decoding AI Judgment: How LLMs Assess News Credibility and Bias [33.7054351451505]
Large Language Models (LLMs) are increasingly embedded in that involve evaluative processes.<n>This raises the need to examine how such evaluations are built, what assumptions they rely on, and how their strategies diverge from those of humans.<n>We benchmark six LLMs against expert ratings--NewsGuard and Media Bias/Fact Check (MBFC)--and against human judgments collected through a controlled experiment.
arXiv Detail & Related papers (2025-02-06T18:52:10Z)
PSSD: Making Large Language Models Self-denial via Human Psyche Structure [5.057375783924452]
We present PSSD, which refers to and implements the human psyche structure such that three distinct and interconnected roles contribute to human reasoning.<n>Extensive experiments demonstrate that the proposed design not only better enhance reasoning capabilities, but also seamlessly integrate with current models.
arXiv Detail & Related papers (2025-02-03T13:37:21Z)
Human and Machine: How Software Engineers Perceive and Engage with AI-Assisted Code Reviews Compared to Their Peers [4.734450431444635]
We investigate how software engineers perceive and engage with Large Language Model (LLM)-assisted code reviews.<n>We found that engagement in code review is multi-dimensional, spanning cognitive, emotional, and behavioral dimensions.<n>Our findings contribute to a deeper understanding of how AI tools are impacting SE socio-technical processes.
arXiv Detail & Related papers (2025-01-03T20:42:51Z)
Can Large Language Models Serve as Evaluators for Code Summarization? [47.21347974031545]
Large Language Models (LLMs) serve as effective evaluators for code summarization methods.<n>LLMs prompt an agent to play diverse roles, such as code reviewer, code author, code editor, and system analyst.<n> CODERPE achieves an 81.59% Spearman correlation with human evaluations, outperforming the existing BERTScore metric by 17.27%.
arXiv Detail & Related papers (2024-12-02T09:56:18Z)
Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword? [14.970843824847956]
We run a controlled experiment with 29 experts who reviewed different programs with/without the support of an automatically generated code review.<n>We show that reviewers consider valid most of the issues automatically identified by the LLM and that the availability of an automated review as a starting point strongly influences their behavior.<n>The reviewers who started from an automated review identified a higher number of low-severity issues while, however, not identifying more high-severity issues as compared to a completely manual process.
arXiv Detail & Related papers (2024-11-18T09:24:01Z)
INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness [110.6921470281479]
We introduce INDICT: a new framework that empowers large language models with Internal Dialogues of Critiques for both safety and helpfulness guidance. The internal dialogue is a dual cooperative system between a safety-driven critic and a helpfulness-driven critic. We observed that our approach can provide an advanced level of critiques of both safety and helpfulness analysis, significantly improving the quality of output codes.
arXiv Detail & Related papers (2024-06-23T15:55:07Z)
Understanding the Building Blocks of Accountability in Software Engineering [3.521765725717803]
We investigate the factors that foster software engineers' individual accountability within their teams. Our findings recognize two primary forms of accountability shaping software engineers individual senses of accountability: institutionalized and grassroots.
arXiv Detail & Related papers (2024-02-02T21:53:35Z)
How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation [46.42384207122049]
We design SimulateBench to evaluate the believability of large language models (LLMs) when simulating human behaviors. Based on SimulateBench, we evaluate the performances of 10 widely used LLMs when simulating characters.
arXiv Detail & Related papers (2023-12-28T16:51:11Z)
CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility [62.74405775089802]
We present CValues, the first Chinese human values evaluation benchmark to measure the alignment ability of LLMs. As a result, we have manually collected adversarial safety prompts across 10 scenarios and induced responsibility prompts from 8 domains. Our findings suggest that while most Chinese LLMs perform well in terms of safety, there is considerable room for improvement in terms of responsibility.
arXiv Detail & Related papers (2023-07-19T01:22:40Z)
The Mind Is a Powerful Place: How Showing Code Comprehensibility Metrics Influences Code Understanding [10.644832702859484]
We investigate whether a displayed metric value for source code comprehensibility anchors developers in their subjective rating of source code comprehensibility. We found that the displayed value of a comprehensibility metric has a significant and large anchoring effect on a developer's code comprehensibility rating.
arXiv Detail & Related papers (2020-12-16T14:27:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.