Intuition to Evidence: Measuring AI's True Impact on Developer Productivity
- URL: http://arxiv.org/abs/2509.19708v1
- Date: Wed, 24 Sep 2025 02:34:11 GMT
- Title: Intuition to Evidence: Measuring AI's True Impact on Developer Productivity
- Authors: Anand Kumar, Vishal Khare, Deepak Sharma, Satyam Kumar, Vijay Saini, Anshul Yadav, Sachendra Jain, Ankit Rana, Pratham Verma, Vaibhav Meena, Avinash Edubilli,
- Abstract summary: We present a comprehensive real-world evaluation of AI-assisted software development tools deployed at enterprise scale.<n>Over one year, 300 engineers across multiple teams integrated an in-house AI platform (DeputyDev) that combines code generation and automated review capabilities.
- Score: 30.02516976149379
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a comprehensive real-world evaluation of AI-assisted software development tools deployed at enterprise scale. Over one year, 300 engineers across multiple teams integrated an in-house AI platform (DeputyDev) that combines code generation and automated review capabilities into their daily workflows. Through rigorous cohort analysis, our study demonstrates statistically significant productivity improvements, including an overall 31.8% reduction in PR review cycle time. Developer adoption was strong, with 85% satisfaction for code review features and 93% expressing a desire to continue using the platform. Adoption patterns showed systematic scaling from 4% engagement in month 1 to 83% peak usage by month 6, stabilizing at 60% active engagement. Top adopters achieved a 61% increase in code volume pushed to production, contributing to approximately 30 to 40% of code shipped to production through this tool, accounting for an overall 28% increase in code shipment volume. Unlike controlled benchmark evaluations, our longitudinal analysis provides empirical evidence from production environments, revealing both the transformative potential and practical deployment challenges of integrating AI into enterprise software development workflows.
Related papers
- CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production [52.85500933801205]
CharacterFlywheel is an iterative process for improving large language models (LLMs) in production social chat applications.<n>We refined models across 15 generations using data from both internal and external real-user traffic.<n>We conducted 7-day A/B tests showing consistent engagement improvements.
arXiv Detail & Related papers (2026-03-02T15:27:31Z) - SWE-Universe: Scale Real-World Verifiable Environments to Millions [84.63665266236963]
SWE-Universe is a framework for automatically constructing real-world software engineering (SWE) verifiable environments from GitHub pull requests (PRs)<n>We propose a building agent powered by an efficient custom-trained model to overcome the prevalent challenges of automatic building.<n>We demonstrate the profound value of our environments through large-scale agentic mid-training and reinforcement learning.
arXiv Detail & Related papers (2026-02-02T17:20:30Z) - EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots [68.29056647487519]
Embodied AI is fueled by high-fidelity simulation and large-scale data collection.<n>However, this scaling capability remains bottlenecked by a reliance on labor-intensive manual oversight.<n>We introduce textscEmboCoach-Bench, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies.
arXiv Detail & Related papers (2026-01-29T11:33:49Z) - A Pragmatic VLA Foundation Model [66.76609538850478]
We develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations.<n>Our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability.<n>To advance the field of robot learning, we provide open access to the code, base model, and benchmark data.
arXiv Detail & Related papers (2026-01-26T17:08:04Z) - Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering [4.812321790984494]
We conduct an analysis of token consumption patterns in an LLM-MA system within the Software Development Life Cycle (SDLC)<n>We analyze execution traces from 30 software development tasks performed by the ChatDev framework using a GPT-5 reasoning model.<n>Our preliminary findings show that the iterative Code Review stage accounts for the majority of token consumption for an average of 59.4% of tokens.
arXiv Detail & Related papers (2026-01-20T20:52:14Z) - WhatsCode: Large-Scale GenAI Deployment for Developer Efficiency at WhatsApp [0.8197659035200293]
Report on the industrial deployment of WhatsCode, a domain-specific AI development system that supports WhatsApp.<n>WhatsCode evolved from targeted privacy automation to autonomous agentic integrated with end-to-end feature development and DevOps processes.<n>System committed 692 automated/fix changes, 711 framework adoptions, 141 feature development assists and maintained precision in bug triage.
arXiv Detail & Related papers (2025-12-04T23:25:06Z) - Developer Productivity with GenAI [17.44738403505224]
We surveyed 415 software practitioners to capture their perceptions of productivity changes associated with AI-assisted development.<n>Results reveal limited overall productivity change, highlighting the productivity paradox in which developers become faster but do not necessarily create better software or feel more fulfilled.
arXiv Detail & Related papers (2025-10-28T10:23:57Z) - SCUBA: Salesforce Computer Use Benchmark [63.66753028386581]
SCUBA is a benchmark designed to evaluate computer-use agents on customer relationship management ( CRM) within the Salesforce platform.<n> SCUBA contains 300 task instances derived from real user interviews, spanning three primary personas, platform administrators, sales representatives, and service agents.<n>We benchmark a diverse set of agents under both zero-shot and demonstration-augmented settings.
arXiv Detail & Related papers (2025-09-30T16:48:49Z) - The Impact of Large Language Models (LLMs) on Code Review Process [2.8071068465772853]
This research investigates the effect of GPT on GitHub pull request (PR)<n>We curated a dataset of 25,473 PRs from 9,254 GitHub projects.<n>We identified GPT-assisted PRs using a semi-automated approach that combines keyword-based detection, regular expression filtering, and manual verification.
arXiv Detail & Related papers (2025-08-14T19:39:01Z) - RocketPPA: Code-Level Power, Performance, and Area Prediction via LLM and Mixture of Experts [4.825037489691159]
This paper presents RocketPPA, a novel ultra-fast power, performance (delay), and area (PPA) estimator.<n>It operates directly at the code-level abstraction using HDL code as input.<n>It achieves significant improvements in the accuracy of PPA estimation compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2025-03-27T20:35:09Z) - DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [55.13854171147104]
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development.<n>We present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents.<n>We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2.
arXiv Detail & Related papers (2025-03-18T14:02:59Z) - Experience with GitHub Copilot for Developer Productivity at Zoominfo [1.631115063641726]
We evaluate GitHub Copilot's deployment and impact on developer productivity at Zoominfo.<n>We show an average acceptance rate of 33% for suggestions and 20% for lines of code, with high developer satisfaction scores of 72%.<n>Our findings contribute to the growing body of knowledge about AI-assisted software development in enterprise settings.
arXiv Detail & Related papers (2025-01-23T00:17:48Z) - How Well Can Modern LLMs Act as Agent Cores in Radiology Environments? [54.36730060680139]
RadA-BenchPlat is an evaluation platform that benchmarks the performance of large language models (LLMs) in radiology environments.<n>The platform also defines ten categories of tools for agent-driven task solving and evaluates seven leading LLMs.
arXiv Detail & Related papers (2024-12-12T18:20:16Z) - Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement [62.94719119451089]
Lingma SWE-GPT series learns from and simulating real-world code submission activities.
Lingma SWE-GPT 72B resolves 30.20% of GitHub issues, marking a significant improvement in automatic issue resolution.
arXiv Detail & Related papers (2024-11-01T14:27:16Z) - Dear Diary: A randomized controlled trial of Generative AI coding tools in the workplace [2.5280615594444567]
Generative AI coding tools are relatively new, and their impact on developers extends beyond traditional coding metrics.
This study aims to illuminate developers' preexisting beliefs about generative AI tools, their self perceptions, and how regular use of these tools may alter these beliefs.
Our findings reveal that the introduction and sustained use of generative AI coding tools significantly increases developers' perceptions of these tools as both useful and enjoyable.
arXiv Detail & Related papers (2024-10-24T00:07:27Z) - SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents [50.82665351100067]
FlowGen is a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents.
We evaluate FlowGenScrum on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET.
arXiv Detail & Related papers (2024-03-23T14:04:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.