Related papers: APEX-SWE

APEX-SWE

URL: http://arxiv.org/abs/2601.08806v1
Date: Tue, 13 Jan 2026 18:44:08 GMT
Title: APEX-SWE
Authors: Abhi Kottamasu, Akul Datta, Aakash Barthwal, Chirag Mahapatra, Ajay Arun, Adarsh Hiremath, Brendan Foody, Bertie Vidgen,
Abstract summary: We introduce the AI Productivity Index for Software Engineering (APEX-SWE)<n>APEX-SWE is a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work.<n> Gemini 3 Pro (Thinking = High) performs best, with a Pass@1 score of 25%.
Score: 4.927317067589892
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce the AI Productivity Index for Software Engineering (APEX-SWE), a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work. Unlike existing evaluations that focus on narrow, well-defined tasks, APEX-SWE assesses two novel task types that reflect real-world software engineering work: (1) Integration tasks (n=100), which require constructing end-to-end systems across heterogeneous cloud primitives, business applications, and infrastructure-as-code services, and (2) Observability tasks (n=100), which require debugging production failures using telemetry signals such as logs and dashboards, as well as unstructured context. We evaluated eight frontier models on APEX-SWE. Gemini 3 Pro (Thinking = High) performs best, with a Pass@1 score of 25\%. Our analysis shows that strong performance is primarily driven by epistemic reasoning, defined as the ability to distinguish between assumptions and verified facts, combined with agency to resolve uncertainty prior to acting. We open-source the APEX-SWE evaluation harness and a dev set (n=50).

Related papers

AI-NativeBench: An Open-Source White-Box Agentic Benchmark Suite for AI-Native Systems [52.65695508605237]
We introduce AI-NativeBench, the first application-centric and white-box AI-Native benchmark suite grounded in Model Context Protocol (MCP) and Agent-to-Agent (A2A) standards.<n>By treating agentic spans as first-class citizens within distributed traces, our methodology enables granular analysis of engineering characteristics beyond simple capabilities.<n>This work provides the first systematic evidence to guide the transition from measuring model capability to engineering reliable AI-Native systems.
arXiv Detail & Related papers (2026-01-14T11:32:07Z)
LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering [90.84806758077536]
We introduce textbfLoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess large language models (LLMs) agents in realistic, long-context software engineering.<n>Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations.<n>Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens.
arXiv Detail & Related papers (2025-11-17T23:57:24Z)
Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development [33.01897134024342]
Development of LLM-based autonomous agents for end-to-end software development represents a significant paradigm shift in software engineering.<n>This work provides the community with a more realistic benchmark, a comprehensive evaluation framework, and crucial insights into the current capabilities and core challenges of software development agents.
arXiv Detail & Related papers (2025-11-06T05:10:04Z)
From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production [6.189323683437766]
This paper reports IBM's experience developing and piloting the Computer Using Generalist Agent (CUGA)<n>CUGA adopts a hierarchical planner--executor architecture with strong analytical foundations.<n>It was evaluated in a pilot within the Business-Process-Outsourcing talent acquisition domain.
arXiv Detail & Related papers (2025-10-27T20:55:00Z)
EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence [17.644658293987955]
Embodied AI agents are capable of robust spatial perception, effective task planning, and adaptive execution in physical environments.<n>Current large language models (LLMs) and multimodal LLMs (MLLMs) for embodied tasks suffer from key limitations.<n>We propose EmbodiedBrain, a novel vision-language foundation model available in both 7B and 32B parameter sizes.
arXiv Detail & Related papers (2025-10-23T14:05:55Z)
xOffense: An AI-driven autonomous penetration testing framework with offensive knowledge-enhanced LLMs and multi agent systems [0.402058998065435]
xOffense is an AI-driven, multi-agent penetration testing framework.<n>It shifts the process from labor-intensive, expert-driven manual efforts to fully automated, machine-executable scaling seamlessly with computational infrastructure.
arXiv Detail & Related papers (2025-09-16T12:45:45Z)
InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling [71.37579508777843]
Large language models (LLMs) have revolutionized artificial intelligence by enabling complex reasoning capabilities.<n>To address this gap, we present InternBootcamp, an open-source framework comprising 1000+ domain-diverse task environments.
arXiv Detail & Related papers (2025-08-12T05:00:00Z)
OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.