Related papers: AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework

AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework

URL: http://arxiv.org/abs/2603.03233v1
Date: Tue, 03 Mar 2026 18:25:00 GMT
Title: AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework
Authors: Zihang Zeng, Jiaquan Zhang, Pengze Li, Yuan Qi, Xi Chen,
Abstract summary: Large Language Models (LLMs) demonstrate potentials for automating scientific code generation but face challenges in reliability, error propagation, and evaluation.<n>We present a Bayesian adversarial multi-agent framework specifically designed for AI for Science (AI4S) tasks in the form of a Low-code Platform (LCP)<n>Three LLM-based agents are coordinated under the Bayesian framework: a Task Manager that structures user inputs into actionable plans and adaptive test cases, a Code Generator that produces candidate solutions, and an Evaluator providing comprehensive feedback.
Score: 4.782965804438204
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) demonstrate potentials for automating scientific code generation but face challenges in reliability, error propagation in multi-agent workflows, and evaluation in domains with ill-defined success metrics. We present a Bayesian adversarial multi-agent framework specifically designed for AI for Science (AI4S) tasks in the form of a Low-code Platform (LCP). Three LLM-based agents are coordinated under the Bayesian framework: a Task Manager that structures user inputs into actionable plans and adaptive test cases, a Code Generator that produces candidate solutions, and an Evaluator providing comprehensive feedback. The framework employs an adversarial loop where the Task Manager iteratively refines test cases to challenge the Code Generator, while prompt distributions are dynamically updated using Bayesian principles by integrating code quality metrics: functional correctness, structural alignment, and static analysis. This co-optimization of tests and code reduces dependence on LLM reliability and addresses evaluation uncertainty inherent to scientific tasks. LCP also streamlines human-AI collaboration by translating non-expert prompts into domain-specific requirements, bypassing the need for manual prompt engineering by practitioners without coding backgrounds. Benchmark evaluations demonstrate LCP's effectiveness in generating robust code while minimizing error propagation. The proposed platform is also tested on an Earth Science cross-disciplinary task and demonstrates strong reliability, outperforming competing models.

Related papers

Adaptive Confidence Gating in Multi-Agent Collaboration for Efficient and Optimized Code Generation [13.994379905835716]
DebateCoder is a multi-agent collaborative framework designed to improve the reasoning ability of Small Language Models (SLMs)<n>It uses a structured role-playing protocol with three agents: User Agent (A_UA), Technical Agent (A_TA), and Quality Assurance Agent (A_QA)<n>It also includes an Adaptive Confidence Gating mechanism with a 95% threshold to balance accuracy and inference efficiency.
arXiv Detail & Related papers (2026-01-29T09:48:15Z)
ComAgent: Multi-LLM based Agentic AI Empowered Intelligent Wireless Networks [62.031889234230725]
6G networks rely on complex cross-layer optimization.<n> manually translating high-level intents into mathematical formulations remains a bottleneck.<n>We present ComAgent, a multi-LLM agentic AI framework.
arXiv Detail & Related papers (2026-01-27T13:43:59Z)
SelfAI: Building a Self-Training AI System with LLM Agents [79.10991818561907]
SelfAI is a general multi-agent platform that combines a User Agent for translating high-level research objectives into standardized experimental configurations.<n>An Experiment Manager orchestrates parallel, fault-tolerant training across heterogeneous hardware while maintaining a structured knowledge base for continuous feedback.<n>Across regression, computer vision, scientific computing, medical imaging, and drug discovery benchmarks, SelfAI consistently achieves strong performance and reduces redundant trials.
arXiv Detail & Related papers (2025-11-29T09:18:39Z)
Toward Automated and Trustworthy Scientific Analysis and Visualization with LLM-Generated Code [6.068120728706316]
Large language models (LLMs) offer a promising solution by generating code from natural language descriptions.<n>We construct a benchmark suite of domain-inspired prompts that reflect real-world research tasks.<n>Our findings show that, without human intervention, the reliability of LLM-generated code is limited.
arXiv Detail & Related papers (2025-11-26T21:27:03Z)
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence [150.3696990310269]
Large language models (LLMs) have transformed automated software development by enabling direct translation of natural language descriptions into functional code.<n>We provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs.<n>We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder)
arXiv Detail & Related papers (2025-11-23T17:09:34Z)
Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol [83.83217247686402]
Large Language Models (LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions.<n>Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance.<n>This paper decomposes LLM applications into a three-layer architecture: textbftextitSystem Shell Layer, textbftextitPrompt Orchestration Layer, and textbftextitLLM Inference Core.
arXiv Detail & Related papers (2025-08-28T13:00:28Z)
Re4: Scientific Computing Agent with Rewriting, Resolution, Review and Revision [4.55391222496256]
Large language models (LLMs) serve as an active and promising field of generative artificial intelligence.<n>In this work, we construct a novel agent framework for solving representative problems in scientific computing.<n>The proposed agent, incorporating a "rewriting-resolution-review-revision" logical chain, is integrated in a collaborative and interactive manner.
arXiv Detail & Related papers (2025-08-28T12:50:48Z)
A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code [49.009041488527544]
A.S.E is a repository-level evaluation benchmark for assessing the security of AI-generated code.<n>Current large language models (LLMs) still struggle with secure coding.<n>A larger reasoning budget does not necessarily lead to better code generation.
arXiv Detail & Related papers (2025-08-25T15:11:11Z)
Benchmarking LLM-based Agents for Single-cell Omics Analysis [6.915378212190715]
AI agents offer a paradigm shift, enabling adaptive planning, executable code generation, traceable decisions, and real-time knowledge fusion.<n>We introduce a novel benchmarking evaluation system to rigorously assess agent capabilities in single-cell omics analysis.
arXiv Detail & Related papers (2025-08-16T04:26:18Z)
Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z)
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation [20.013757490442064]
We introduce CodeIF, the first benchmark designed to assess the abilities of Large Language Models (LLMs) to adhere to task-oriented instructions.<n>CodeIF encompasses a broad range of tasks, including function synthesis, algorithmic instructions, and code explanation.<n>We conduct extensive experiments with LLMs, analyzing their strengths and limitations in meeting the demands of these tasks.
arXiv Detail & Related papers (2025-02-26T14:19:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.