Variational Reasoning for Language Models
- URL: http://arxiv.org/abs/2509.22637v2
- Date: Wed, 15 Oct 2025 14:08:12 GMT
- Title: Variational Reasoning for Language Models
- Authors: Xiangxin Zhou, Zichen Liu, Haonan Wang, Chao Du, Min Lin, Chongxuan Li, Liang Wang, Tianyu Pang,
- Abstract summary: We introduce a variational reasoning framework for language models that treats thinking traces as latent variables.<n>We show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives.
- Score: 93.08197299751197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.
Related papers
- Learning Structured Reasoning via Tractable Trajectory Control [99.75278337895024]
Ctrl-R is a framework for learning structured reasoning via tractable trajectory control.<n>We show that Ctrl-R enables effective exploration and internalization of previously unattainable reasoning patterns.
arXiv Detail & Related papers (2026-03-02T09:18:19Z) - Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs [49.66344956133349]
Reasoning capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models.<n>This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a latent variable for strategic contextualization.
arXiv Detail & Related papers (2025-12-19T03:32:53Z) - Reasoning Relay: Evaluating Stability and Interchangeability of Large Language Models in Mathematical Reasoning [8.01259760303241]
We investigate whether a partially completed reasoning chain can be reliably continued by another model.<n>We use token-level log-probability thresholds to truncate reasoning at early, mid, and late stages from our baseline models.<n>Our findings point towards interchangeability as an emerging behavioral property of reasoning models.
arXiv Detail & Related papers (2025-12-16T02:56:44Z) - Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes [55.2480439325792]
Reinforcement learning (RL) has proven remarkably effective at improving the accuracy of language models in verifiable and deterministic domains like mathematics.<n>Here, we examine if current RL methods are also effective at optimizing language models in verifiable domains with outcomes, like scientific experiments.
arXiv Detail & Related papers (2025-08-15T20:50:53Z) - CTRLS: Chain-of-Thought Reasoning via Latent State-Transition [57.51370433303236]
Chain-of-thought (CoT) reasoning enables large language models to break down complex problems into interpretable intermediate steps.<n>We introduce groundingS, a framework that formulates CoT reasoning as a Markov decision process (MDP) with latent state transitions.<n>We show improvements in reasoning accuracy, diversity, and exploration efficiency across benchmark reasoning tasks.
arXiv Detail & Related papers (2025-07-10T21:32:18Z) - Alignment as Distribution Learning: Your Preference Model is Explicitly a Language Model [12.063078727764045]
We argue that alignment via reinforcement learning from human feedback lacks theoretical justification and incentivizes deterministic solutions.<n>We propose three principled learning objectives: preference maximum likelihood estimation, preference distillation, and reverse KL minimization.<n>We empirically demonstrate that our distribution learning framework, especially preference distillation, consistently outperforms or matches the performances of RLHF and DPO.
arXiv Detail & Related papers (2025-06-02T10:36:31Z) - A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models [53.18562650350898]
Chain-of-thought (CoT) reasoning enhances performance of large language models.<n>We present the first comprehensive study of CoT faithfulness in large vision-language models.
arXiv Detail & Related papers (2025-05-29T18:55:05Z) - LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation [1.2576388595811496]
We introduce LINGOLY-TOO, a challenging reasoning benchmark grounded in natural language.<n>We permute reasoning problems written in real languages to generate numerous question variations.<n>Experiments and analyses show that models can circumvent reasoning and answer from prior knowledge.
arXiv Detail & Related papers (2025-03-04T19:57:47Z) - Reparameterized Variational Rejection Sampling [12.189621777178354]
Variational Rejection Sampling (VRS) combines a parametric proposal distribution with sampling rejection to define a rich non-parametric family of distributions.
We show that our method performs well in practice and that it is well-suited for black-box inference, especially for models with local latent variables.
arXiv Detail & Related papers (2023-09-26T01:46:53Z) - Explaining Language Models' Predictions with High-Impact Concepts [11.47612457613113]
We propose a complete framework for extending concept-based interpretability methods to NLP.
We optimize for features whose existence causes the output predictions to change substantially.
Our method achieves superior results on predictive impact, usability, and faithfulness compared to the baselines.
arXiv Detail & Related papers (2023-05-03T14:48:27Z) - Variational Causal Networks: Approximate Bayesian Inference over Causal
Structures [132.74509389517203]
We introduce a parametric variational family modelled by an autoregressive distribution over the space of discrete DAGs.
In experiments, we demonstrate that the proposed variational posterior is able to provide a good approximation of the true posterior.
arXiv Detail & Related papers (2021-06-14T17:52:49Z) - Decision-Making with Auto-Encoding Variational Bayes [71.44735417472043]
We show that a posterior approximation distinct from the variational distribution should be used for making decisions.
Motivated by these theoretical results, we propose learning several approximate proposals for the best model.
In addition to toy examples, we present a full-fledged case study of single-cell RNA sequencing.
arXiv Detail & Related papers (2020-02-17T19:23:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.