Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs
- URL: http://arxiv.org/abs/2512.02719v1
- Date: Tue, 02 Dec 2025 12:51:30 GMT
- Title: Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs
- Authors: Julian Ma, Jun Wang, Zafeirios Fountas,
- Abstract summary: Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored.<n>We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction.<n>We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks over text and image.<n>We measure performance, behaviour and efficiency in multimodal cue-combination.
- Score: 6.415869990358189
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance, behaviour and efficiency in multimodal cue-combination. Beyond accuracy and efficiency metrics, we introduce a Bayesian Consistency Score that detects Bayes-consistent behavioural shifts even when accuracy saturates. Our results show that while capable models often adapt in Bayes-consistent ways, accuracy does not guarantee robustness. Notably, GPT-5 Mini achieves perfect text accuracy but fails to integrate visual cues efficiently. This reveals a critical dissociation between capability and strategy, suggesting accuracy-centric benchmarks may over-index on performance while missing brittle uncertainty handling. These findings reveal emergent principled handling of uncertainty and highlight the correlation between accuracy and Bayesian tendencies. We release our psychophysics benchmark and consistency metric (https://bayes-bench.github.io) as evaluation tools and to inform future multimodal architecture designs.
Related papers
- Goal-Oriented Influence-Maximizing Data Acquisition for Learning and Optimization [28.53710231018475]
We propose an active acquisition algorithm that avoids explicit posterior inference while remaining uncertainty-aware through inverse curvature.<n>GOIMDA selects inputs by maximizing their expected influence on a user-specified goal functional.<n>We show theoretically that, for generalized linear models, GOIMDA approximates predictive-entropy minimization up to a correction term accounting for goal alignment and prediction bias.
arXiv Detail & Related papers (2026-02-23T07:57:11Z) - Bayesian E(3)-Equivariant Interatomic Potential with Iterative Restratification of Many-body Message Passing [11.101638985590002]
Current equis struggle with uncertainty, limiting their reliability for active learning, calibration, and out-of-distribution detection.<n>We address these challenges by developing Bayesian E(3)variants with iterative restratification of many-body message passing.<n>Our approach introduces the joint energy-force negative log-likelihood (NLL$_textJEF$) loss function, which explicitly establishes uncertainty in both energies and interatomic forces.<n>We demonstrate that NLL$_textJEF$ facilitates efficient active learning by quantifying energy and force.
arXiv Detail & Related papers (2025-10-03T14:28:10Z) - Personality as a Probe for LLM Evaluation: Method Trade-offs and Downstream Effects [0.6087817758152709]
We present a systematic study of personality control using the Big Five traits.<n>Trait-level analysis shows openness as uniquely challenging, agreeableness as most resistant to ICL.<n>Experiments on Gemma-2-2B-IT and LLaMA-3-8B-Instruct reveal clear trade-offs.
arXiv Detail & Related papers (2025-09-05T04:19:15Z) - Post-hoc Probabilistic Vision-Language Models [54.05237186168399]
Vision-language models (VLMs) have found remarkable success in classification, retrieval, and generative tasks.<n>We propose post-hoc uncertainty estimation in VLMs that does not require additional training.<n>Our results show promise for safety-critical applications of large-scale models.
arXiv Detail & Related papers (2024-12-08T18:16:13Z) - FamiCom: Further Demystifying Prompts for Language Models with Task-Agnostic Performance Estimation [73.454943870226]
Language models have shown impressive in-context-learning capabilities.
We propose a measure called FamiCom, providing a more comprehensive measure for task-agnostic performance estimation.
arXiv Detail & Related papers (2024-06-17T06:14:55Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Exploring the Performance of Continuous-Time Dynamic Link Prediction Algorithms [14.82820088479196]
Dynamic Link Prediction (DLP) addresses the prediction of future links in evolving networks.
In this work, we contribute tools to perform such a comprehensive evaluation.
We describe an exhaustive taxonomy of negative sampling methods that can be used at evaluation time.
arXiv Detail & Related papers (2024-05-27T14:03:28Z) - Self-Evaluation Guided Beam Search for Reasoning [61.523627290397556]
We introduce a stepwise self-evaluation mechanism to guide and calibrate the reasoning process of Large Language Model (LLM)
We propose a decoding algorithm integrating the self-evaluation guidance via beam search.
Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by $6.34%$, $9.56%$, and $5.46%$ on the GSM8K, AQuA, and StrategyQA.
arXiv Detail & Related papers (2023-05-01T02:37:59Z) - Robust Fitted-Q-Evaluation and Iteration under Sequentially Exogenous Unobserved Confounders [9.401989343015364]
We study robust policy evaluation and policy optimization in the presence of sequentially-exogenous unobserved confounders.<n>We provide sample complexity bounds, insights, and show effectiveness both in simulations and on real-world longitudinal healthcare data of treating sepsis.
arXiv Detail & Related papers (2023-02-01T18:40:53Z) - Efficient Model-based Multi-agent Reinforcement Learning via Optimistic
Equilibrium Computation [93.52573037053449]
H-MARL (Hallucinated Multi-Agent Reinforcement Learning) learns successful equilibrium policies after a few interactions with the environment.
We demonstrate our approach experimentally on an autonomous driving simulation benchmark.
arXiv Detail & Related papers (2022-03-14T17:24:03Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.