Related papers: SLEAN: Simple Lightweight Ensemble Analysis Network for Multi-Provider LLM Coordination: Design, Implementation, and Vibe Coding Bug Investigation Case Study

SLEAN: Simple Lightweight Ensemble Analysis Network for Multi-Provider LLM Coordination: Design, Implementation, and Vibe Coding Bug Investigation Case Study

URL: http://arxiv.org/abs/2510.10010v1
Date: Sat, 11 Oct 2025 04:24:04 GMT
Title: SLEAN: Simple Lightweight Ensemble Analysis Network for Multi-Provider LLM Coordination: Design, Implementation, and Vibe Coding Bug Investigation Case Study
Authors: Matheus J. T. Vargas,
Abstract summary: SLEAN operates as a simple prompt bridge between LLMs using.txt templates, requiring no deep technical knowledge for deployment.<n>The three-phase protocol formed by independent analysis, cross-critique, and arbitration, filters harmful AI-generated code suggestions.<n>The file-driven, provider-agnostic architecture enables deployment without specialized coding expertise.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present SLEAN (Simple Lightweight Ensemble Analysis Network), a deterministic framework for coordinating multiple LLM providers through text-based prompt orchestration. Unlike complex multi-agent systems requiring specialized infrastructure, SLEAN operates as a simple prompt bridge between LLMs using .txt templates, requiring no deep technical knowledge for deployment. The three-phase protocol formed by independent analysis, cross-critique, and arbitration, filters harmful AI-generated code suggestions before production deployment, addressing how AI-assisted debugging increasingly produces modifications that introduce unnecessary complexity, break existing functionality, or address problems. Evaluating 15 software bugs, we analyzed 69 AI-generated fix propositions. SLEAN's filtering accepted 22 fixes (31.9%, 95% CI 20.9-42.9%) while rejecting 47 that would have been harmful if applied verbatim. The arbitration process reduced code change surface by 83-90% relative to raw AI outputs, enforcing minimal causal edits over scope-expanding modifications. Minimal Type 2 inputs proved more efficient than detailed Type 1 inputs, requiring 2.85 versus 3.56 propositions per accepted fix (35.1% versus 28.1% acceptance, about a 20% efficiency gain). Agreement between AI systems showed weak correlation with fix quality: high convergence (at least 80%) occurred in 4 of 15 cases and improved acceptance by only 2.4% points; arbitration appeared only at exactly 10% convergence in 2 of 15 cases, although low convergence alone did not necessitate arbitration. The file-driven, provider-agnostic architecture enables deployment without specialized coding expertise, making it applicable to security auditing, code review, document verification, and other domains requiring reliable multi-provider synthesis with end-to-end auditability.

Related papers

ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization [6.572539312871392]
Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk.<n>We introduce ReLoop, addressing silent failures from two complementary directions.
arXiv Detail & Related papers (2026-02-17T20:20:33Z)
SPECA: Specification-to-Checklist Agentic Auditing for Multi-Implementation Systems -- A Case Study on Ethereum Clients [1.711666249985278]
SPECA is a Specification-to-Checklist framework that turns normative requirements into checklists.<n>We instantiate SPECA in an in-the-wild security audit contest for the Fusaka upgrade, covering 11 production clients.<n>Our improved agent, evaluated against the ground truth of a competitive audit, achieved a strict recall of 27.3 percent on high-impact vulnerabilities.
arXiv Detail & Related papers (2026-02-07T12:19:00Z)
Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z)
Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches [0.0]
Dynamic template selection achieves significant cost reductions without compromising response quality.<n>This work contributes several key elements: formal problem formulation with theoretical grounding in machine learning, four algorithms with corresponding complexity analyses, and extensive empirical validation across production systems.
arXiv Detail & Related papers (2025-11-17T21:00:22Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
Evaluating the Use of LLMs for Documentation to Code Traceability [3.076436880934678]
Large Language Models can establish trace links between various software documentation and source code.<n>We create two novel datasets from two open-source projects (Unity Catalog and Crawl4AI)<n>Results show that the best-performing LLM achieves F1-scores of 79.4% and 80.4% across the two datasets.
arXiv Detail & Related papers (2025-06-19T16:18:53Z)
Automated Repair of Ambiguous Problem Descriptions for LLM-Based Code Generation [9.943472604121425]
ambiguity of natural language (NL) can harm software quality.<n>We introduce an automated repair of ambiguous NL descriptions.<n>We implement this approach in a tool called SpecFix.
arXiv Detail & Related papers (2025-05-12T06:47:53Z)
Less is More: Efficient Black-box Attribution via Minimal Interpretable Subset Selection [52.716143424856185]
We propose LiMA (Less input is More faithful for Attribution), which reformulates the attribution of important regions as an optimization problem for submodular subset selection.<n>LiMA identifies both the most and least important samples while ensuring an optimal attribution boundary that minimizes errors.<n>Our method also outperforms the greedy search in attribution efficiency, being 1.6 times faster.
arXiv Detail & Related papers (2025-04-01T06:58:15Z)
SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z)
EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking [58.15568681219339]
We introduce EquiBench, a new benchmark for evaluating large language models (LLMs)<n>This task directly tests a model's ability to reason about program semantics.<n>We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline.
arXiv Detail & Related papers (2025-02-18T02:54:25Z)
Classification or Prompting: A Case Study on Legal Requirements Traceability [4.629156733452248]
Legal requirements traceability is a key task where engineers must analyze technical requirements against target artifacts.<n>In this paper, we investigate two automated solutions based on language models, including large ones (LLMs)<n>The first solution, Kashif, is a classifier that leverages sentence transformers and semantic similarity.<n>The second solution, RICE_LRT, prompts a recent generative LLM based on RICE, a prompt engineering framework.
arXiv Detail & Related papers (2025-02-07T13:33:40Z)
Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design [63.24275274981911]
Compound AI Systems consisting of many language model inference calls are increasingly employed. In this work, we construct systems, which we call Networks of Networks (NoNs) organized around the distinction between generating a proposed answer and verifying its correctness. We introduce a verifier-based judge NoN with K generators, an instantiation of "best-of-K" or "judge-based" compound AI systems.
arXiv Detail & Related papers (2024-07-23T20:40:37Z)
Global Context Aggregation Network for Lightweight Saliency Detection of Surface Defects [70.48554424894728]
We develop a Global Context Aggregation Network (GCANet) for lightweight saliency detection of surface defects on the encoder-decoder structure. First, we introduce a novel transformer encoder on the top layer of the lightweight backbone, which captures global context information through a novel Depth-wise Self-Attention (DSA) module. The experimental results on three public defect datasets demonstrate that the proposed network achieves a better trade-off between accuracy and running efficiency compared with other 17 state-of-the-art methods.
arXiv Detail & Related papers (2023-09-22T06:19:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.