Nexus: Execution-Grounded Multi-Agent Test Oracle Synthesis
- URL: http://arxiv.org/abs/2510.26423v1
- Date: Thu, 30 Oct 2025 12:20:25 GMT
- Title: Nexus: Execution-Grounded Multi-Agent Test Oracle Synthesis
- Authors: Dong Huang, Mingzhe Du, Jie M. Zhang, Zheng Lin, Meng Luo, Qianru Zhang, See-Kiong Ng,
- Abstract summary: Test oracle generation in non-regression testing is a longstanding challenge in software engineering.<n>We introduce Nexus, a novel multi-agent framework to address this challenge.
- Score: 57.40527331817245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Test oracle generation in non-regression testing is a longstanding challenge in software engineering, where the goal is to produce oracles that can accurately determine whether a function under test (FUT) behaves as intended for a given input. In this paper, we introduce Nexus, a novel multi-agent framework to address this challenge. Nexus generates test oracles by leveraging a diverse set of specialized agents that synthesize test oracles through a structured process of deliberation, validation, and iterative self-refinement. During the deliberation phase, a panel of four specialist agents, each embodying a distinct testing philosophy, collaboratively critiques and refines an initial set of test oracles. Then, in the validation phase, Nexus generates a plausible candidate implementation of the FUT and executes the proposed oracles against it in a secure sandbox. For any oracle that fails this execution-based check, Nexus activates an automated selfrefinement loop, using the specific runtime error to debug and correct the oracle before re-validation. Our extensive evaluation on seven diverse benchmarks demonstrates that Nexus consistently and substantially outperforms state-of-theart baselines. For instance, Nexus improves the test-level oracle accuracy on the LiveCodeBench from 46.30% to 57.73% for GPT-4.1-Mini. The improved accuracy also significantly enhances downstream tasks: the bug detection rate of GPT4.1-Mini generated test oracles on HumanEval increases from 90.91% to 95.45% for Nexus compared to baselines, and the success rate of automated program repair improves from 35.23% to 69.32%.
Related papers
- MIST-RL: Mutation-based Incremental Suite Testing via Reinforcement Learning [19.054149750597933]
MIST-RL (Mutation-based Incremental Suite Testing via Reinforcement Learning) is a framework that shifts the focus to "scaling-by-utility"<n>We introduce a novel incremental mutation reward combined with dynamic penalties, which incentivizes the model to discover new faults while it suppresses functionally equivalent assertions.<n>Experiments on HumanEval+ and MBPP+ demonstrate that MIST-RL outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-03-02T03:22:44Z) - Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All [57.23434868678603]
Live-kBench is an evaluation framework for self-evolving benchmarks that scrapes and evaluates agents on freshly discovered kernel bugs.<n> kEnv is an agent-agnostic crash-resolution environment for kernel compilation, execution, and feedback.<n>Using kEnv, we benchmark three state-of-the-art agents, showing that they resolve 74% of crashes on the first attempt.
arXiv Detail & Related papers (2026-02-02T19:06:15Z) - Can We Predict Before Executing Machine Learning Agents? [74.39460101251792]
We formalize the task of Data-centric Solution Preference and construct a comprehensive corpus of 18,438 pairwise comparisons.<n>We demonstrate that LLMs exhibit significant predictive capabilities when primed with a Verified Data Analysis Report.<n>We instantiate this framework in FOREAGENT, an agent that employs a Predict-then-Verify loop, achieving a 6x acceleration in convergence while surpassing execution-based baselines by +6%.
arXiv Detail & Related papers (2026-01-09T16:44:17Z) - Hallucination to Consensus: Multi-Agent LLMs for End-to-End Test Generation [2.794277194464204]
Unit testing plays a critical role in ensuring software correctness.<n>Traditional methods rely on search-based or randomized algorithms to achieve high code coverage.<n>We propose CANDOR, a novel prompt engineering-based LLM framework for automated unit test generation in Java.
arXiv Detail & Related papers (2025-06-03T14:43:05Z) - Eliminating Hallucination-Induced Errors in LLM Code Generation with Functional Clustering [0.0]
We present functional clustering, a black-box wrapper that eliminates nearly all hallucination-induced errors while providing a tunable confidence score.<n>Our verifier preserves baseline pass@1 on solvable tasks yet slashes the error rate of returned answers from 65% to 2%.<n>Because the method requires only sampling and sandbox execution, it applies unchanged to closed-source APIs and future models.
arXiv Detail & Related papers (2025-05-16T18:19:38Z) - Learning to Generate Unit Tests for Automated Debugging [52.63217175637201]
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs)<n>We propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs.<n>We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs.
arXiv Detail & Related papers (2025-02-03T18:51:43Z) - AugmenTest: Enhancing Tests with LLM-Driven Oracles [2.159639193866661]
AugmenTest is an approach leveraging Large Language Models to infer correct test oracles based on available documentation of the software under test.<n>AugmenTest includes four variants: Simple Prompt, Extended Prompt, RAG with a generic prompt (without the context of class or method under test), and RAG with Simple Prompt, each offering different levels of contextual information to the LLMs.<n>Results show that in the most conservative scenario, AugmenTest's Extended Prompt consistently outperformed the Simple Prompt, achieving a success rate of 30% for generating correct assertions.
arXiv Detail & Related papers (2025-01-29T07:45:41Z) - TOGLL: Correct and Strong Test Oracle Generation with LLMs [0.8057006406834466]
Test oracles play a crucial role in software testing, enabling effective bug detection.<n>Despite initial promise, neural-based methods for automated test oracle generation often result in a large number of false positives.<n>We present the first comprehensive study to investigate the capabilities of LLMs in generating correct, diverse, and strong test oracles.
arXiv Detail & Related papers (2024-05-06T18:37:35Z) - Towards Automatic Generation of Amplified Regression Test Oracles [44.45138073080198]
We propose a test oracle derivation approach to amplify regression test oracles.
The approach monitors the object state during test execution and compares it to the previous version to detect any changes in relation to the SUT's intended behaviour.
arXiv Detail & Related papers (2023-07-28T12:38:44Z) - Sequential Kernelized Independence Testing [77.237958592189]
We design sequential kernelized independence tests inspired by kernelized dependence measures.<n>We demonstrate the power of our approaches on both simulated and real data.
arXiv Detail & Related papers (2022-12-14T18:08:42Z) - Noisy Adaptive Group Testing using Bayesian Sequential Experimental
Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually.
Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.