Related papers: Wink: Recovering from Misbehaviors in Coding Agents

Wink: Recovering from Misbehaviors in Coding Agents

URL: http://arxiv.org/abs/2602.17037v2
Date: Fri, 20 Feb 2026 18:13:08 GMT
Title: Wink: Recovering from Misbehaviors in Coding Agents
Authors: Rahul Nanda, Chandra Maddila, Smriti Jha, Euna Mehnaz Khan, Matteo Paltenghi, Satish Chandra,
Abstract summary: Autonomous coding agents are increasingly being adopted in the software industry to automate complex engineering tasks.<n>These agents are prone to a wide range of misbehaviors, such as deviating from the user's instructions, getting stuck in repetitive loops, or failing to use tools correctly.<n>We present a system for automatically recovering from agentic misbehaviors at scale.
Score: 6.794419834325995
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autonomous coding agents, powered by large language models (LLMs), are increasingly being adopted in the software industry to automate complex engineering tasks. However, these agents are prone to a wide range of misbehaviors, such as deviating from the user's instructions, getting stuck in repetitive loops, or failing to use tools correctly. These failures disrupt the development workflow and often require resource-intensive manual intervention. In this paper, we present a system for automatically recovering from agentic misbehaviors at scale. We first introduce a taxonomy of misbehaviors grounded in an analysis of production traffic, identifying three primary categories: Specification Drift, Reasoning Problems, and Tool Call Failures, which we find occur in about 30% of all agent trajectories. To address these issues, we developed a lightweight, asynchronous self-intervention system named Wink. Wink observes agent trajectories and provides targeted course-correction guidance to nudge the agent back to a productive path. We evaluated our system on over 10,000 real world agent trajectories and found that it successfully resolves 90% of the misbehaviors that require a single intervention. Furthermore, a live A/B test in our production environment demonstrated that our system leads to a statistically significant reduction in Tool Call Failures, Tokens per Session and Engineer Interventions per Session. We present our experience designing and deploying this system, offering insights into the challenges of building resilient agentic systems at scale.

Related papers

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning [34.06688334066569]
AgentDropoutV2 is a test-time rectify-or-reject pruning framework designed to dynamically optimize MAS information flow without retraining.<n>Our approach acts as an active firewall, intercepting agent outputs and employing a retrieval-augmented to iteratively correct errors.<n> Empirical results on extensive math benchmarks show that AgentDropoutV2 significantly boosts the MAS's task performance.
arXiv Detail & Related papers (2026-02-26T17:31:43Z)
The Why Behind the Action: Unveiling Internal Drivers via Agentic Attribution [63.61358761489141]
Large Language Model (LLM)-based agents are widely used in real-world applications such as customer service, web navigation, and software engineering.<n>We propose a novel framework for textbfgeneral agentic attribution, designed to identify the internal factors driving agent actions regardless of the task outcome.<n>We validate our framework across a diverse suite of agentic scenarios, including standard tool use and subtle reliability risks like memory-induced bias.
arXiv Detail & Related papers (2026-01-21T15:22:21Z)
Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization [69.36509281190662]
Adapting production-level computer vision tools to bespoke scientific datasets is a critical "last mile" bottleneck.<n>We consider using AI agents to automate this manual coding, and focus on the open question of optimal agent design.<n>We demonstrate that a simple agent framework consistently generates adaptation code that outperforms human-expert solutions.
arXiv Detail & Related papers (2025-12-02T18:42:26Z)
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation [87.47155146067962]
We provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of tasks.<n>We conduct three-dimensional analysis spanning models, scaffolds, and benchmarks.<n>Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs.
arXiv Detail & Related papers (2025-10-13T22:22:28Z)
AgentRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering [51.07491603393163]
tAgent is a framework that formulates multi-agent QA as a knowledge-graph-guided routing problem supervised by empirical performance signals.<n>By leveraging soft supervision and weighted aggregation of agent outputs, Agent learns principled collaboration schemes that capture the complementary strengths of diverse agents.
arXiv Detail & Related papers (2025-10-06T23:20:49Z)
Diagnosing Failure Root Causes in Platform-Orchestrated Agentic Systems: Dataset, Taxonomy, and Benchmark [23.342903884925576]
This paper presents a study of root cause identification of platform-orchestrated agentic systems.<n>We construct a dataset AgentFail containing 307 failure logs from ten agentic systems, each with fine-grained annotations linking failures to their root causes.<n>We develop a taxonomy that characterizes failure root causes and analyze their distribution across different platforms and task domains.
arXiv Detail & Related papers (2025-09-28T08:30:03Z)
Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks [8.218266805768687]
We present a benchmark of 34 representative programmable tasks designed to rigorously assess autonomous agents.<n>We evaluate three popular open-source agent frameworks combined with two LLM backbones, observing a task completion rate of approximately 50%.<n>We develop a three-tier taxonomy of failure causes aligned with task phases, highlighting planning errors, task execution issues, and incorrect response generation.
arXiv Detail & Related papers (2025-08-18T17:55:22Z)
TRAIL: Trace Reasoning and Agentic Issue Localization [5.025960714013197]
This work articulates the need for robust and dynamic evaluation methods for agentic workflow traces.<n>We present a set of 148 large human-annotated traces (TRAIL) constructed using this taxonomy and grounded in established agentic benchmarks.<n>To ensure ecological validity, we curate traces from both single and multi-agent systems.
arXiv Detail & Related papers (2025-05-13T14:55:31Z)
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems [50.29939179830491]
Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive.<n>We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons.<n>The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
arXiv Detail & Related papers (2025-04-30T23:09:44Z)
Agent-as-a-Judge: Evaluate Agents with Agents [61.33974108405561]
We introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We present DevAI, a new benchmark of 55 realistic automated AI development tasks.
arXiv Detail & Related papers (2024-10-14T17:57:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.