ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?
- URL: http://arxiv.org/abs/2602.19594v1
- Date: Mon, 23 Feb 2026 08:37:53 GMT
- Title: ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?
- Authors: Ayush Nangia, Shikhar Mishra, Aman Gokrani, Paras Chopra,
- Abstract summary: We introduce ISO-Bench, a benchmark for coding agents to test their capabilities on real-world inference tasks.<n>We curated 54 tasks from merged pull requests with measurable performance improvements.
- Score: 0.8749675983608171
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce ISO-Bench, a benchmark for coding agents to test their capabilities on real-world inference optimization tasks. These tasks were taken from vLLM and SGLang, two of the most popular LLM serving frameworks. Each task provides an agent with a codebase and bottleneck description, whereby the agent must produce an optimization patch evaluated against expert human solutions. We curated 54 tasks from merged pull requests with measurable performance improvements. While existing benchmarks heavily use runtime-based metrics, such approaches can be gamed to pass tests without capturing the actual intent of the code changes. Therefore, we combine both hard (execution-based) and soft (LLM-based) metrics to show that both are necessary for complete evaluation. While evaluating both closed and open-source coding agents, we find no single agent dominates across codebases. Surprisingly, agents often identify correct bottlenecks but fail to execute working solutions. We also show that agents with identical underlying models differ substantially, suggesting scaffolding is as important as the model.
Related papers
- Benchmark Test-Time Scaling of General LLM Agents [27.756239376314294]
General AgentBench is a benchmark for evaluating general LLM agents across search, coding, reasoning, and tool-use domains.<n>We study performance degradation when moving from domain-specific evaluations to this general-agent setting.<n>We find that neither scaling yields effective performance improvements in practice, due to two fundamental limitations.
arXiv Detail & Related papers (2026-02-22T01:08:02Z) - Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation [0.0]
Agent-Diff is a benchmarking framework for evaluating agentic Large Language Models (LLMs) on real-world tasks that execute code via external APIs.<n>We provide benchmarks for nine LLMs across 224 tasks utilizing enterprise software.<n>We also evaluate the robustness of the framework with ablation experiments to assess the contribution of access to API documentation on benchmark performance.
arXiv Detail & Related papers (2026-02-11T13:31:18Z) - AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent [57.10083973844841]
AgentArk is a novel framework to distill multi-agent dynamics into the weights of a single model.<n>We investigate three hierarchical distillation strategies across various models, tasks, scaling, and scenarios.<n>By shifting the burden of computation from inference to training, the distilled models preserve the efficiency of one agent while exhibiting strong reasoning and self-correction performance of multiple agents.
arXiv Detail & Related papers (2026-02-03T19:18:28Z) - The Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check [54.08619694620588]
We present a comprehensive evaluation of dLLMs across two distinct agentic paradigms: Embodied Agents and Tool-Calling Agents.<n>Our results on Agentboard and BFCL reveal a "bitter lesson": current dLLMs fail to serve as reliable agentic backbones.
arXiv Detail & Related papers (2026-01-19T11:45:39Z) - AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress [71.02263260394261]
Large language models (LLMs) still encounter challenges in multi-turn decision-making tasks.<n>We build process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process.<n>AgentPRM captures both the interdependence between sequential decisions and their contribution to the final goal.
arXiv Detail & Related papers (2025-11-11T14:57:54Z) - SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads? [22.075705411944895]
SWE-fficiency is a benchmark for evaluating repository-level performance optimization on real workloads.<n>Our suite contains 498 tasks across nine widely used data-science, machine-learning, and HPC repositories.
arXiv Detail & Related papers (2025-11-08T17:55:09Z) - Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z) - Distilling LLM Agent into Small Models with Retrieval and Code Tools [65.73762766854192]
Agent Distillation is a framework for transferring reasoning capability and task-solving behavior from large language models into small language models.<n>Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models.
arXiv Detail & Related papers (2025-05-23T08:20:15Z) - AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories [61.38499597241457]
We propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents.<n>Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks.<n>We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents.
arXiv Detail & Related papers (2025-04-11T19:49:22Z) - A Unified Debugging Approach via LLM-Based Multi-Agent Synergy [39.11825182386288]
FixAgent is an end-to-end framework for unified debug through multi-agent synergy.
It significantly outperforms state-of-the-art repair methods, fixing 1.25$times$ to 2.56$times$ bugs on the repo-level benchmark, Defects4J.
arXiv Detail & Related papers (2024-04-26T04:55:35Z) - AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents [19.439775106707344]
AgentQuest is a framework where benchmarks and metrics are modular and easily through well documented and easy-to-use APIs.
We offer two new evaluation metrics that can reliably track LLM agent progress while solving a task.
We exemplify the utility of the metrics on two use cases wherein we identify common failure points and refine the agent architecture to obtain a significant performance increase.
arXiv Detail & Related papers (2024-04-09T16:01:24Z) - MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation [96.71370747681078]
We introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM.
For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs.
We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate.
arXiv Detail & Related papers (2023-10-05T04:06:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.