Related papers: Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol

Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol

URL: http://arxiv.org/abs/2508.20737v1
Date: Thu, 28 Aug 2025 13:00:28 GMT
Title: Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol
Authors: Wei Ma, Yixiao Yang, Qiang Hu, Shi Ying, Zhi Jin, Bo Du, Zhenchang Xing, Tianlin Li, Junjie Shi, Yang Liu, Linxiao Jiang,
Abstract summary: Large Language Models (LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions.<n>Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance.<n>This paper decomposes LLM applications into a three-layer architecture: textbftextitSystem Shell Layer, textbftextitPrompt Orchestration Layer, and textbftextitLLM Inference Core.
Score: 83.83217247686402
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Applications of Large Language Models~(LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions. Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance. This paper decomposes LLM applications into a three-layer architecture: \textbf{\textit{System Shell Layer}}, \textbf{\textit{Prompt Orchestration Layer}}, and \textbf{\textit{LLM Inference Core}}. We then assess the applicability of traditional software testing methods in each layer: directly applicable at the shell layer, requiring semantic reinterpretation at the orchestration layer, and necessitating paradigm shifts at the inference core. A comparative analysis of Testing AI methods from the software engineering community and safety analysis techniques from the AI community reveals structural disconnects in testing unit abstraction, evaluation metrics, and lifecycle management. We identify four fundamental differences that underlie 6 core challenges. To address these, we propose four types of collaborative strategies (\emph{Retain}, \emph{Translate}, \emph{Integrate}, and \emph{Runtime}) and explore a closed-loop, trustworthy quality assurance framework that combines pre-deployment validation with runtime monitoring. Based on these strategies, we offer practical guidance and a protocol proposal to support the standardization and tooling of LLM application testing. We propose a protocol \textbf{\textit{Agent Interaction Communication Language}} (AICL) that is used to communicate between AI agents. AICL has the test-oriented features and is easily integrated in the current agent framework.

Related papers

Large Language Model Agent for User-friendly Chemical Process Simulations [0.0]
A large language model (LLM) agent is integrated with AVEVA Process Model Protocol (MCP), allowing natural language simulations.<n>Two case studies assess the framework across different task complexities and interaction modes.<n>The framework benefits both educational purposes, by translating technical concepts and demonstrating, and experienced practitioners by automating data extraction, speeding routine tasks, and supporting.<n>While current limitations such as oversimplification, calculation errors, and technical hiccups mean expert oversight is still needed, the framework suggests LLM-based agents can become valuable collaborators.
arXiv Detail & Related papers (2026-01-15T12:18:45Z)
Policy-Conditioned Policies for Multi-Agent Task Solving [53.67744322553693]
In this work, we propose a paradigm shift that bridges the gap by representing policies as human-interpretable source code.<n>We reformulate the learning problem by utilizing Large Language Models (LLMs) as approximate interpreters.<n>We formalize this process as textitProgrammatic Iterated Best Response (PIBR), an algorithm where the policy code is optimized by textual gradients.
arXiv Detail & Related papers (2025-12-24T07:42:10Z)
The Meta-Prompting Protocol: Orchestrating LLMs via Adversarial Feedback Loops [0.6345523830122167]
Meta-Prompt Protocol formalizes the orchestration of Large Language Models as a programmable, self-optimizing system.<n>Treating natural language instructions as differentiable variables within a semantic graph and utilizing textual critiques as gradients, this architecture mitigates hallucination and prevents model collapse.
arXiv Detail & Related papers (2025-12-17T03:32:21Z)
SelfAI: Building a Self-Training AI System with LLM Agents [79.10991818561907]
SelfAI is a general multi-agent platform that combines a User Agent for translating high-level research objectives into standardized experimental configurations.<n>An Experiment Manager orchestrates parallel, fault-tolerant training across heterogeneous hardware while maintaining a structured knowledge base for continuous feedback.<n>Across regression, computer vision, scientific computing, medical imaging, and drug discovery benchmarks, SelfAI consistently achieves strong performance and reduces redundant trials.
arXiv Detail & Related papers (2025-11-29T09:18:39Z)
BarrierBench : Evaluating Large Language Models for Safety Verification in Dynamical Systems [4.530582224312311]
We introduce an LLM-based agentic framework for barrier certificate synthesis.<n>The framework uses natural language reasoning to propose, refine, and validate candidate certificates.<n> BarrierBench is a benchmark of 100 dynamical systems spanning linear, nonlinear, discrete-time, and continuous-time settings.
arXiv Detail & Related papers (2025-11-12T14:23:49Z)
A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System [56.40989626804489]
This survey provides the first holistic analysis of Large Language Models-powered software engineering.<n>We review over 150 recent papers and propose a taxonomy along two key dimensions: (1) Solutions, categorized into prompt-based, fine-tuning-based, and agent-based paradigms, and (2) Benchmarks, including tasks such as code generation, translation, and repair.
arXiv Detail & Related papers (2025-10-10T06:56:50Z)
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers [86.00932417210477]
We introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers.<n>Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching.<n>We find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations.
arXiv Detail & Related papers (2025-08-20T13:28:58Z)
CoRe: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks [14.408364047538578]
Large language models (LLMs) have been widely adopted across diverse domains of software engineering.<n>This work presents CORE, a benchmark designed to evaluate LLMs on fundamental static analysis tasks.
arXiv Detail & Related papers (2025-07-03T01:35:58Z)
Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z)
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios [51.46347732659174]
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications.<n>AgentIF is the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios.
arXiv Detail & Related papers (2025-05-22T17:31:10Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems [10.67359331022116]
textitTalk Structurally, Act Hierarchically (TalkHier) is a novel framework that introduces a structured communication protocol for context-rich exchanges.<n>textitTalkHier surpasses various types of SoTA, including inference scaling model (OpenAI-o1), open-source multi-agent models (e.g., AgentVerse)
arXiv Detail & Related papers (2025-02-16T12:26:58Z)
Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.<n>Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.<n> Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z)
Advancing Code Coverage: Incorporating Program Analysis with Large Language Models [8.31978033489419]
We propose TELPA, a novel technique to generate tests that can reach hard-to-cover branches.<n>Our experimental results on 27 open-source Python projects demonstrate that TELPA significantly outperforms the state-of-the-art SBST and LLM-based techniques.
arXiv Detail & Related papers (2024-04-07T14:08:28Z)
Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM [32.44432906540792]
We present SymPrompt, a code-aware prompting strategy for large language models in test generation. SymPrompt enhances correct test generations by a factor of 5 and bolsters relative coverage by 26% for CodeGen2. Notably, when applied to GPT-4, SymPrompt improves coverage by over 2x compared to baseline prompting strategies.
arXiv Detail & Related papers (2024-01-31T18:21:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.