Related papers: Taming Scylla: Understanding the multi-headed agentic daemon of the coding seas

Taming Scylla: Understanding the multi-headed agentic daemon of the coding seas

URL: http://arxiv.org/abs/2602.08765v1
Date: Mon, 09 Feb 2026 15:06:24 GMT
Title: Taming Scylla: Understanding the multi-headed agentic daemon of the coding seas
Authors: Micah Villmow,
Abstract summary: This paper introduces Scylla, an evaluation framework for benchmarking agentic coding tools.<n>The key metric is Cost-of-Pass (CoP), which directly quantifies the trade-off between complexity and efficiency.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM-based tools are automating more software development tasks at a rapid pace, but there is no rigorous way to evaluate how different architectural choices -- prompts, skills, tools, multi-agent setups -- materially affect both capability and cost. This paper introduces Scylla, an evaluation framework for benchmarking agentic coding tools through structured ablation studies that uses seven testing tiers (T0-T6) progressively adding complexity to isolate what directly influences results and how. The key metric is Cost-of-Pass (CoP): the expected dollar cost to get one correct solution, which directly quantifies the trade-off between complexity and efficiency. The framework is model-agnostic, designed to work with any CLI tool; this paper demonstrates it with Claude Sonnet 4.5, using multiple LLM judges (Opus 4.5, Sonnet 4.5, Haiku 4.5) from the same vendor for evaluation consensus, where judges score results using direct tests, human-designed LLM-evaluated rubrics, and qualitative assessment. The result is a reproducible framework that quantifies trade-offs between agent complexity and actual outcomes, suggesting that architectural complexity does not always improve quality.

Related papers

Cost-Aware Model Selection for Text Classification: Multi-Objective Trade-offs Between Fine-Tuned Encoders and LLM Prompting in Production [0.0]
Large language models (LLMs) have demonstrated strong capabilities in open-ended reasoning and generative language tasks.<n>For structured text classification problems with fixed label spaces, model selection is often driven by predictive performance alone.<n>We show that fine-tuned encoder-based models from the BERT family achieve competitive, and often superior, classification performance.
arXiv Detail & Related papers (2026-02-06T03:54:28Z)
MARS: Modular Agent with Reflective Search for Automated AI Research [48.54202614558741]
We introduce MARS, a framework optimized for autonomous AI research.<n>MARS relies on three pillars: (1) Budget-Aware Planning via cost-Aware Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a "Design-Decompose-Implement" pipeline to manage complex repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights.
arXiv Detail & Related papers (2026-02-02T19:00:03Z)
LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering [90.84806758077536]
We introduce textbfLoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess large language models (LLMs) agents in realistic, long-context software engineering.<n>Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations.<n>Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens.
arXiv Detail & Related papers (2025-11-17T23:57:24Z)
LM4Opt-RA: A Multi-Candidate LLM Framework with Structured Ranking for Automating Network Resource Allocation [0.7933039558471408]
We tackle complex analytical and mathematical reasoning tasks requiring nuanced contextual understanding.<n>Existing benchmarks datasets cannot address the complexities of such problems with dynamic environments, variables, and heterogeneous constraints.<n>We introduce NL4RA, a curated dataset comprising 50 resource allocation optimization problems formulated as LP, ILP, and MILP.<n>We then evaluate the performance of well-known open-source LLMs with varying parameter counts.
arXiv Detail & Related papers (2025-11-13T23:19:43Z)
AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition [27.312190686305588]
Large language models (LLMs) have shown strong potential in automated scoring.<n>Their use as end-to-end raters faces challenges such as low accuracy, prompt sensitivity, limited interpretability, and rubric misalignment.<n>We propose AutoSCORE, a multi-agent LLM framework enhancing automated scoring via rubric-aligned Structured COmponent REcognition.
arXiv Detail & Related papers (2025-09-26T05:45:14Z)
Automated Validation of LLM-based Evaluators for Software Engineering Artifacts [0.7548538278943616]
REFINE (Ranking Evaluators for FIne grained Nuanced Evaluation) is an automated framework for benchmarking large language models (LLMs)<n> REFINE applies novel generation techniques to automatically synthesize artifacts with progressively reduced quality.<n>It quantifies each candidate evaluator configuration by measuring how closely its rankings align with expected ordering.
arXiv Detail & Related papers (2025-08-04T18:52:01Z)
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling [58.05959902776133]
We introduce Single-Pass.<n>with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation.<n>We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP)<n>On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $sim$16% of training samples compared to human-labeled and other synthetically trained baselines.
arXiv Detail & Related papers (2025-06-18T14:37:59Z)
AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs [24.403284945948272]
AutoJudger is an agent-driven framework for efficient and adaptive benchmarking of multimodal large language models.<n>AutoJudger employs the Item Response Theory (IRT) to estimate the question difficulty and an autonomous evaluation agent to dynamically select the most informative test questions.
arXiv Detail & Related papers (2025-05-27T16:17:15Z)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
SCAN: Structured Capability Assessment and Navigation for LLMs [54.54085382131134]
textbfSCAN (Structured Capability Assessment and Navigation) is a practical framework that enables detailed characterization of Large Language Models.<n>SCAN incorporates four key components:.<n>TaxBuilder, which extracts capability-indicating tags from queries to construct a hierarchical taxonomy;.<n>RealMix, a query synthesis and filtering mechanism that ensures sufficient evaluation data for each capability tag;.<n>A PC$2$-based (Pre-Comparison-derived Criteria) LLM-as-a-Judge approach achieves significantly higher accuracy compared to classic LLM-as-a-Judge method
arXiv Detail & Related papers (2025-05-10T16:52:40Z)
GeoBenchX: Benchmarking LLMs in Agent Solving Multistep Geospatial Tasks [0.11458853556386796]
This paper establishes a benchmark for evaluating tool-calling capabilities of large language models (LLMs)<n>We assess eight commercial LLMs (Claude Sonnet 3.5 and 4, Claude Haiku 3.5, Gemini 2.0 Flash, Gemini 2.5 Pro Preview, GPT-4o, GPT-4.1 and o4-mini) using a simple tool-calling agent equipped with 23 geospatial functions.<n>Results show o4-mini and Claude 3.5 Sonnet achieve the best overall performance, OpenAI's GPT-4.1, GPT-4o and Google's Gemini 2.5 Pro Preview do not fall far behind, but the last two are more efficient in
arXiv Detail & Related papers (2025-03-23T16:20:14Z)
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.