Related papers: Bayesian Orchestration of Multi-LLM Agents for Cost-Aware Sequential Decision-Making

Bayesian Orchestration of Multi-LLM Agents for Cost-Aware Sequential Decision-Making

URL: http://arxiv.org/abs/2601.01522v1
Date: Sun, 04 Jan 2026 13:19:27 GMT
Title: Bayesian Orchestration of Multi-LLM Agents for Cost-Aware Sequential Decision-Making
Authors: Danial Amin,
Abstract summary: Large language models (LLMs) are increasingly deployed as autonomous decision agents in settings with asymmetric error costs.<n>We propose a Bayesian, cost-aware multi-LLM orchestration framework that treats LLMs as approximate likelihood models.
Score: 1.2691047660244335
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs) are increasingly deployed as autonomous decision agents in settings with asymmetric error costs: hiring (missed talent vs wasted interviews), medical triage (missed emergencies vs unnecessary escalation), and fraud detection (approved fraud vs declined legitimate payments). The dominant design queries a single LLM for a posterior over states, thresholds "confidence," and acts; we prove this is inadequate for sequential decisions with costs. We propose a Bayesian, cost-aware multi-LLM orchestration framework that treats LLMs as approximate likelihood models rather than classifiers. For each candidate state, we elicit likelihoods via contrastive prompting, aggregate across diverse models with robust statistics, and update beliefs with Bayes rule under explicit priors as new evidence arrives. This enables coherent belief updating, expected-cost action selection, principled information gathering via value of information, and fairness gains via ensemble bias mitigation. In resume screening with costs of 40000 USD per missed hire, 2500 USD per interview, and 150 USD per phone screen, experiments on 1000 resumes using five LLMs (GPT-4o, Claude 4.5 Sonnet, Gemini Pro, Grok, DeepSeek) reduce total cost by 294000 USD (34 percent) versus the best single-LLM baseline and improve demographic parity by 45 percent (max group gap 22 to 5 percentage points). Ablations attribute 51 percent of savings to multi-LLM aggregation, 43 percent to sequential updating, and 20 percent to disagreement-triggered information gathering, consistent with the theoretical benefits of correct probabilistic foundations.

Related papers

Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference [10.009730627424629]
Large Language Models (LLMs) have revolutionized inference across diverse natural language tasks.<n>We propose a confidence-driven strategy that dynamically selects the most suitable model based on confidence estimates.
arXiv Detail & Related papers (2026-02-25T16:38:03Z)
Evaluating LLMs in Finance Requires Explicit Bias Consideration [88.38155218924999]
Finance-specific biases can inflate performance, contaminate backtests, and make reported results useless for deployment claims.<n>No single bias is discussed in more than 28 percent of studies.<n>We propose a Structural Validity Framework and an evaluation checklist with minimal requirements for bias diagnosis and future system design.
arXiv Detail & Related papers (2026-02-15T17:02:01Z)
EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge [8.50639201265868]
We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples.<n>We mine boundary cases where two strong annotators conflict, using a judge to resolve labels.<n>Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points.
arXiv Detail & Related papers (2026-01-14T04:26:43Z)
Audit Me If You Can: Query-Efficient Active Fairness Auditing of Black-Box LLMs [4.673176641454931]
Large Language Models (LLMs) exhibit systematic biases across demographic groups.<n>We conceptualise auditing as uncertainty estimation over a target fairness metric.<n>We introduce BAFA, the Bounded Active Fairness Auditor for query-efficient auditing of black-box LLMs.
arXiv Detail & Related papers (2026-01-06T15:22:23Z)
HuggingR$^{4}$: A Progressive Reasoning Framework for Discovering Optimal Model Companions [50.61510609116118]
HuggingR$4$ is a novel framework that combines Reasoning, Retrieval, Refinement, and Reflection to efficiently select models.<n>It attains a workability rate of 92.03% and a reasonability rate of 82.46%, surpassing existing method by 26.51% and 33.25% respectively.
arXiv Detail & Related papers (2025-11-24T03:13:45Z)
LLMs Can Get "Brain Rot"! [68.08198331505695]
Continual exposure to junk web text induces lasting cognitive decline in large language models (LLMs)<n>We run controlled experiments on real Twitter/X corpora, constructing junk and reversely controlled datasets.<n>Results provide significant, multi-perspective evidence that data quality is a causal driver of LLM capability decay.
arXiv Detail & Related papers (2025-10-15T13:28:49Z)
A$^2$FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning [40.6234318894435]
Large language models split into two families: reasoning-centric LLMs and agentic LLMs.<n>This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries.<n>We present Adaptive Agent Foundation Model (A$2$FM), a unified framework that follows a route-then-align principle.
arXiv Detail & Related papers (2025-10-13T17:08:25Z)
Scaling Truth: The Confidence Paradox in AI Fact-Checking [0.8201655885319955]
Large language models (LLMs) hold promise in automating fact verification, yet their effectiveness across global contexts remains uncertain.<n>We systematically evaluate nine established LLMs across multiple categories using 5,000 claims previously assessed by 174 professional fact-checking organizations across 47 languages.<n>Findings reveal a concerning pattern resembling the Dunning-Kruger effect: smaller, models show high confidence despite lower accuracy, while larger models demonstrate higher accuracy but lower confidence.
arXiv Detail & Related papers (2025-09-10T17:36:25Z)
Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning [71.3533541927459]
We propose a novel data selection paradigm termed Activation Reasoning Potential (RAP)<n>RAP identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning.<n>Our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%.
arXiv Detail & Related papers (2025-06-05T08:40:24Z)
Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning [59.56171041796373]
We harvest multi-modal instructional data in a robust and efficient manner.<n>We take interaction style as diversity indicator and use a multi-modal rich styler to identify data instruction patterns.<n>Across 10+ experimental settings, validated by 14 multi-modal benchmarks, we demonstrate consistent improvements over random sampling, baseline strategies and state-of-the-art selection methods.
arXiv Detail & Related papers (2025-03-17T17:11:22Z)
Cost-Saving LLM Cascades with Early Abstention [1.3108652488669732]
We investigate the benefits of "early abstention" in LLM cascades.<n>We find that it reduces overall test loss by 2.2% on average across six benchmarks.<n>These gains result from a more effective use of abstention, trading a 4.1% average increase in the overall abstention rate for a 13.0% reduction in cost and a 5.0% reduction in error rate.
arXiv Detail & Related papers (2025-02-13T08:08:39Z)
Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios [49.53589774730807]
Multimodal large language models (MLLMs) have recently achieved state-of-the-art performance on tasks ranging from visual question answering to video understanding.<n>We reveal a response uncertainty phenomenon: twelve state-of-the-art open-source MLLMs overturn a previously correct answer in 65% of cases after receiving a single deceptive cue.
arXiv Detail & Related papers (2024-11-05T01:11:28Z)
Generative Verifiers: Reward Modeling as Next-Token Prediction [29.543787728397643]
We propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation.<n>Compared to standard verifiers, such generative verifiers (GenRM) can benefit from several advantages of LLMs.<n>We observe improvements of 28% $rightarrow$ 44.6% on MATH, and 37.9% $rightarrow$ 53.5% on MMLU abstract algebra.
arXiv Detail & Related papers (2024-08-27T17:57:45Z)
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling [81.34900892130929]
We explore inference compute as another axis for scaling, using the simple technique of repeatedly sampling candidate solutions from a model.<n>Across multiple tasks and models, we observe that coverage scales with the number of samples over four orders of magnitude.<n>In domains like coding and formal proofs, where answers can be automatically verified, these increases in coverage directly translate into improved performance.
arXiv Detail & Related papers (2024-07-31T17:57:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.