Related papers: Agentic Rubrics as Contextual Verifiers for SWE Agents

Agentic Rubrics as Contextual Verifiers for SWE Agents

URL: http://arxiv.org/abs/2601.04171v1
Date: Wed, 07 Jan 2026 18:38:23 GMT
Title: Agentic Rubrics as Contextual Verifiers for SWE Agents
Authors: Mohit Raghavendra, Anisha Gunjal, Bing Liu, Yunzhong He,
Abstract summary: We show that Agentic rubrics provide an efficient, scalable, and granular verification signal for SWE agents.<n>Results show that Agentic rubrics are consistent with ground-truth tests while also flagging issues that tests do not capture.
Score: 8.469998524915818
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Verification is critical for improving agents: it provides the reward signal for Reinforcement Learning and enables inference-time gains through Test-Time Scaling (TTS). Despite its importance, verification in software engineering (SWE) agent settings often relies on code execution, which can be difficult to scale due to environment setup overhead. Scalable alternatives such as patch classifiers and heuristic methods exist, but they are less grounded in codebase context and harder to interpret. To this end, we explore Agentic Rubrics: an expert agent interacts with the repository to create a context-grounded rubric checklist, and candidate patches are then scored against it without requiring test execution. On SWE-Bench Verified under parallel TTS evaluation, Agentic Rubrics achieve a score of 54.2% on Qwen3-Coder-30B-A3B and 40.6% on Qwen3-32B, with at least a +3.5 percentage-point gain over the strongest baseline in our comparison set. We further analyze rubric behavior, showing that rubric scores are consistent with ground-truth tests while also flagging issues that tests do not capture. Our ablations show that agentic context gathering is essential for producing codebase-specific, unambiguous criteria. Together, these results suggest that Agentic Rubrics provide an efficient, scalable, and granular verification signal for SWE agents.

Related papers

AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation [39.61543921719145]
AgentSelect is a benchmark that reframes agent selection as narrative query-to-agent recommendation.<n>It converts heterogeneous evaluation artifacts into unified, positive-only interaction data.<n>AgentSelect provides the first unified data and evaluation infrastructure for agent recommendation.
arXiv Detail & Related papers (2026-03-04T06:17:51Z)
TRACER: Trajectory Risk Aggregation for Critical Episodes in Agentic Reasoning [4.928838343487574]
Existing uncertainty proxies focus on single-shot text generation.<n>We introduce TRACER, a trajectory-level uncertainty metric for dual-control Tool-Agent-User interaction.
arXiv Detail & Related papers (2026-02-11T22:23:56Z)
Scaling Agentic Verifier for Competitive Coding [66.11758166379092]
Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt.<n>Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling.<n>We propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs.
arXiv Detail & Related papers (2026-02-04T06:30:40Z)
ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z)
TeaRAG: A Token-Efficient Agentic Retrieval-Augmented Generation Framework [62.66056331998838]
TeaRAG is a token-efficient agentic RAG framework capable of compressing both retrieval content and reasoning steps.<n>Our reward function evaluates the knowledge sufficiency by a knowledge matching mechanism, while penalizing excessive reasoning steps.
arXiv Detail & Related papers (2025-11-07T16:08:34Z)
Issue-Oriented Agent-Based Framework for Automated Review Comment Generation [15.04868140672973]
RevAgent is a novel agent-based issue-oriented framework for code review comments.<n>It decomposes the task into three stages: Generation, Discrimination, and Training.<n>It significantly outperforms state-of-the-art PLM- and LLM-based baselines.
arXiv Detail & Related papers (2025-11-01T11:44:11Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z)
Establishing Best Practices for Building Rigorous Agentic Benchmarks [94.69724201080155]
We show that many agentic benchmarks have issues in task setup or reward design.<n>Such issues can lead to under- or overestimation of agents' performance by up to 100% in relative terms.<n>We introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience.
arXiv Detail & Related papers (2025-07-03T17:35:31Z)
Automated Benchmark Generation for Repository-Level Coding Tasks [7.305342793164905]
SetUpAgent is a fully automated system capable of historically accurate dependency setup, test execution, and result parsing.<n>We generate two new datasets: (i) SWEE-Bench an extended version of SWE-Bench encompassing hundreds of repositories, and (ii) SWA-Bench a benchmark focusing on applications rather than libraries.
arXiv Detail & Related papers (2025-03-10T17:42:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.