The Reasoning-Creativity Trade-off: Toward Creativity-Driven Problem Solving
- URL: http://arxiv.org/abs/2601.00747v1
- Date: Fri, 02 Jan 2026 17:10:31 GMT
- Title: The Reasoning-Creativity Trade-off: Toward Creativity-Driven Problem Solving
- Authors: Max Ruiz Luyten, Mihaela van der Schaar,
- Abstract summary: State-of-the-art large language model (LLM) pipelines rely on bootstrapped reasoning loops.<n>We analyze how this design choice is sensitive to the collapse of the model's distribution over reasoning paths.<n>We introduce Distributional Creative Reasoning (DCR), a unified variational objective that casts training as gradient flow through probability measures on solution traces.
- Score: 57.652356955571065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: State-of-the-art large language model (LLM) pipelines rely on bootstrapped reasoning loops: sampling diverse chains of thought and reinforcing the highest-scoring ones, mainly optimizing correctness. We analyze how this design choice is sensitive to the collapse of the model's distribution over reasoning paths, slashing semantic entropy and undermining creative problem-solving. To analyze this failure, we introduce Distributional Creative Reasoning (DCR), a unified variational objective that casts training as gradient flow through probability measures on solution traces. STaR, GRPO, and DPO, as well as entropy bonuses, and other methods, all constitute special cases of the same loss. The framework delivers three core results: (i) the diversity decay theorem, describing how correctness-based objectives lead to distinct modes of diversity decay for STaR, GRPO, and DPO; (ii) designs that ensure convergence to a stable and diverse policy, effectively preventing collapse; and (iii) simple, actionable recipes to achieve this in practice. DCR thus offers the first principled recipe for LLMs that remain both correct and creative.
Related papers
- Know What You Know: Metacognitive Entropy Calibration for Verifiable RL Reasoning [31.629261193485053]
Large reasoning models (LRMs) have emerged as a powerful paradigm for solving complex real-world tasks.<n>Most existing outcome-only RLVR pipelines rely almost exclusively on a binary correctness signal and largely ignore the model's intrinsic uncertainty.<n>We propose EGPO, a metacognitive entropy calibration framework that explicitly integrates intrinsic uncertainty into RLVR for enhancing LRMs.
arXiv Detail & Related papers (2026-02-26T08:40:06Z) - DiffuReason: Bridging Latent Reasoning and Generative Refinement for Sequential Recommendation [20.756497463882763]
We propose DiffuReason, a unified "Think-then-Diffuse" framework for sequential recommendation.<n>It integrates multi-step Thinking Tokens for latent reasoning, diffusion-based refinement for denoising intermediate representations, and end-to-end Group Relative Policy Optimization.<n>Experiments on four benchmarks demonstrate that DiffuReason consistently improves diverse backbone architectures.
arXiv Detail & Related papers (2026-02-10T12:55:30Z) - On the Plasticity and Stability for Post-Training Large Language Models [54.757672540381236]
We identify a root cause as the conflict between plasticity and stability gradients.<n>We propose Probabilistic Conflict Resolution (PCR), a framework that models gradients as random variables.<n>PCR significantly smooths the training trajectory and achieves superior performance in various reasoning tasks.
arXiv Detail & Related papers (2026-02-06T07:31:26Z) - Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities [10.235183326885794]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs)<n>We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths.<n>We propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses.
arXiv Detail & Related papers (2026-02-05T04:06:55Z) - SetPO: Set-Level Policy Optimization for Diversity-Preserving LLM Reasoning [50.93295951454092]
We introduce a set level diversity objective defined over sampled trajectories using kernelized similarity.<n>Our approach derives a leave-one-out marginal contribution for each sampled trajectory and integrates this objective as a plug-in advantage shaping term for policy optimization.<n>Experiments across a range of model scales demonstrate the effectiveness of our proposed algorithm, consistently outperforming strong baselines in both Pass@1 and Pass@K across various benchmarks.
arXiv Detail & Related papers (2026-02-01T07:13:20Z) - Ground What You See: Hallucination-Resistant MLLMs via Caption Feedback, Diversity-Aware Sampling, and Conflict Regularization [38.469173375694076]
This paper systematically analyzes the root causes of hallucinations in Multimodal Large Language Models (MLLMs)<n>It identifies three critical factors: (1) an over-reliance on chained visual reasoning, where inaccurate initial descriptions anchor subsequent inferences to incorrect premises; (2) insufficient exploration diversity during policy optimization, leading the model to generate overly confident but erroneous outputs; and (3) destructive conflicts between training samples, where NTK similarity causes false associations and unstable parameter updates.<n> Experimental results demonstrate that our proposed method significantly reduces hallucination rates and effectively enhances the inference accuracy of MLLMs.
arXiv Detail & Related papers (2026-01-09T07:59:18Z) - Deconstructing Generative Diversity: An Information Bottleneck Analysis of Discrete Latent Generative Models [4.138804085040435]
Generative diversity varies significantly across discrete latent generative models such as AR, MIM, and Diffusion.<n>We propose a diagnostic framework, grounded in Information Bottleneck (IB) theory, to analyze the underlying strategies resolving this behavior.
arXiv Detail & Related papers (2025-12-01T16:13:23Z) - Multi-Path Collaborative Reasoning via Reinforcement Learning [54.8518809800168]
Chain-of-Thought (CoT) reasoning has significantly advanced the problem-solving capabilities of Large Language Models (LLMs)<n>Recent methods attempt to address this by generating soft abstract tokens to enable reasoning in a continuous semantic space.<n>We propose Multi-Path Perception Policy Optimization (M3PO), a novel reinforcement learning framework that explicitly injects collective insights into the reasoning process.
arXiv Detail & Related papers (2025-12-01T10:05:46Z) - Latent Chain-of-Thought for Visual Reasoning [53.541579327424046]
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs)<n>We reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference.<n>We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks.
arXiv Detail & Related papers (2025-10-27T23:10:06Z) - Generative Reasoning Recommendation via LLMs [48.45009951684554]
Large language models (LLMs) face fundamental challenges in functioning as generative reasoning recommendation models (GRRMs)<n>This work explores how to build GRRMs by adapting pre-trained LLMs, which achieves a unified understanding-reasoning-prediction manner for recommendation tasks.<n>We propose GREAM, an end-to-end framework that integrates three components: Collaborative-Semantic Alignment, Reasoning Curriculum Activation, and Sparse-Regularized Group Policy Optimization.
arXiv Detail & Related papers (2025-10-23T17:59:31Z) - Step-Aware Policy Optimization for Reasoning in Diffusion Large Language Models [57.42778606399764]
Diffusion language models (dLLMs) offer a promising, non-autoregressive paradigm for text generation.<n>Current reinforcement learning approaches often rely on sparse, outcome-based rewards.<n>We argue that this stems from a fundamental mismatch with the natural structure of reasoning.
arXiv Detail & Related papers (2025-10-02T00:34:15Z) - SPREAD: Sampling-based Pareto front Refinement via Efficient Adaptive Diffusion [0.8594140167290097]
SPREAD is a generative framework based on Denoising Diffusion Probabilistic Models (DDPMs)<n>It learns a conditional diffusion process over points sampled from the decision space.<n>It refines candidates via a sampling scheme that uses an adaptive multiple gradient descent-inspired update for fast convergence.
arXiv Detail & Related papers (2025-09-25T12:09:37Z) - Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning [31.727984223052648]
This paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model.<n>We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o.<n>We then prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks.
arXiv Detail & Related papers (2025-05-06T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.