Related papers: Improve Vision Language Model Chain-of-thought Reasoning

Related papers

Beyond Correctness: Learning Robust Reasoning via Transfer [51.403609251508904]
We adopt a simple philosophical view, robust reasoning should remain useful beyond the mind that produced it.<n>We introduce Reinforcement Learning with Transferable Reward, which operationalizes robustness via transfer reward.<n>Our approach improves sampling consistency while improving final answer accuracy, and it reaches comparable performance in substantially fewer training steps.
arXiv Detail & Related papers (2026-02-09T10:41:44Z)
When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents [2.689316553293938]
Supervised fine-tuning (SFT) has emerged as one of the most effective ways to improve the performance of large language models (LLMs) in downstream tasks.<n>We propose a pipeline in which LLMs generate reasoning steps that guide both the invocation of tools and the final answer generation for conversational agents.
arXiv Detail & Related papers (2025-12-12T04:44:40Z)
See, Think, Learn: A Self-Taught Multimodal Reasoner [3.443084677278651]
We propose a simple yet effective self-training framework called See-Think-Learn.<n>At its core, STL introduces a structured reasoning template that encourages the model to see before thinking.<n>We augment the training data with negative rationales to enhance the model's ability to distinguish between correct and misleading responses.
arXiv Detail & Related papers (2025-12-02T06:30:10Z)
HoneyBee: Data Recipes for Vision-Language Reasoners [90.83745691506329]
We introduce several data curation approaches and study their impacts on vision-language models (VLMs)<n>We analyze the effects of context (image and question pair) sources, implement targeted data interventions, and explore scaling up images, questions, and chain-of-thought (CoT) solutions.<n>Motivated by these insights, we introduce HoneyBee, a large-scale, high-quality CoT reasoning dataset with 2.5M examples consisting 350K image-question pairs.
arXiv Detail & Related papers (2025-10-14T07:23:44Z)
Mitigating Spurious Correlations Between Question and Answer via Chain-of-Thought Correctness Perception Distillation [25.195244084313114]
Chain-of-Thought Correctness Perception Distillation (CoPeD) aims to improve the reasoning quality of the student model.<n>CoPeD encourages the student model to predict answers based on correct rationales and revise them when they are incorrect.
arXiv Detail & Related papers (2025-09-06T05:33:17Z)
Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning [0.42855555838080844]
This study investigates the spatial reasoning capabilities of vision-language models (VLMs) through Chain-of-Thought prompting and reinforcement learning.<n>We find that simple CoT formats, where the model generates a reasoning step before the answer, can harm the model's original performance.<n>In contrast, structured multi-stage prompting based on scene graphs (SceneGraph CoT) significantly improves spatial reasoning accuracy.
arXiv Detail & Related papers (2025-07-06T10:51:12Z)
VL-GenRM: Enhancing Vision-Language Verification via Vision Experts and Iterative Training [23.391643634478587]
Vision-Language Reward Model (VL-RM) is key to aligning VL models by providing structured feedback.<n> bootstrapping dilemma arises as high-quality training data depends on already strong VL models.<n>We propose an iterative training framework leveraging vision experts, Chain-of-Thought rationales, and Margin-based Rejection Sampling.
arXiv Detail & Related papers (2025-06-16T18:10:51Z)
Interleaved Reasoning for Large Language Models via Reinforcement Learning [22.403928213802036]
Long chain-of-thought (CoT) enhances large language models' (LLM) reasoning capabilities.<n>We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions.
arXiv Detail & Related papers (2025-05-26T07:58:17Z)
Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning [54.65050470296886]
We propose the CoT Thought Leap Bridge Task, which aims to automatically detect leaps and generate missing intermediate reasoning steps.<n>We demonstrate that models fine-tuned on bridged datasets consistently outperform those trained on original datasets.<n>Our approach effectively enhances distilled data and provides better starting points for reinforcement learning.
arXiv Detail & Related papers (2025-05-20T17:59:31Z)
Phi-4-reasoning Technical Report [42.508165017775]
We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. We develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning. Both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model.
arXiv Detail & Related papers (2025-04-30T05:05:09Z)
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT) Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z)
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement [91.88062410741833]
This study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs) We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization. OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrates the potential of our strategy for robust vision-language reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z)
Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps [3.8936716676293917]
This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data. We identify a critical parameter threshold (1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning.
arXiv Detail & Related papers (2025-02-21T00:48:32Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
RL-STaR: Theoretical Analysis of Reinforcement Learning Frameworks for Self-Taught Reasoner [2.779063752888881]
Self-taught reasoner (STaR) framework uses reinforcement learning to automatically generate reasoning steps. STaR and its variants have demonstrated empirical success, but a theoretical foundation explaining these improvements is lacking. This work provides a theoretical framework for understanding the effectiveness of reinforcement learning on CoT reasoning and STaR.
arXiv Detail & Related papers (2024-10-31T13:17:53Z)
Vision-Language Models Can Self-Improve Reasoning via Reflection [20.196406628954303]
Chain-of-thought (CoT) has proven to improve the reasoning capability of large language models (LLMs) We propose a self-training framework, R3V, which iteratively enhances the model's Vision-language Reasoning by Reflecting on CoT Rationales. Our approach supports self-reflection on generated solutions, further boosting performance through test-time computation.
arXiv Detail & Related papers (2024-10-30T14:45:00Z)
Improving Language Model Reasoning with Self-motivated Learning [60.779625789039486]
textitSelf-motivated Learning framework motivates the model itself to automatically generate rationales on existing datasets. We train a reward model with the rank to evaluate the quality of rationales, and improve the performance of reasoning through reinforcement learning.
arXiv Detail & Related papers (2024-04-10T14:05:44Z)
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs. We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency. We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z)
Entailment as Robust Self-Learner [14.86757876218415]
We design a prompting strategy that formulates a number of different NLU tasks as contextual entailment. We propose the Simple Pseudo-Label Editing (SimPLE) algorithm for better pseudo-labeling quality in self-training.
arXiv Detail & Related papers (2023-05-26T18:41:23Z)
SCOTT: Self-Consistent Chain-of-Thought Distillation [68.40232422158569]
Large language models (LMs) generate free-text rationales for their predictions via chain-of-thought prompting. We propose a faithful knowledge distillation method to learn a small, self-consistent CoT model from a teacher model that is orders of magnitude larger. To ensure faithful distillation, we use the teacher-generated rationales to learn a student LM with a counterfactual reasoning objective.
arXiv Detail & Related papers (2023-05-03T03:47:00Z)
Harnessing the Power of Explanations for Incremental Training: A LIME-Based Approach [6.244905619201076]
In this work, model explanations are fed back to the feed-forward training to help the model generalize better. The framework incorporates the custom weighted loss with Elastic Weight Consolidation (EWC) to maintain performance in sequential testing sets. The proposed custom training procedure results in a consistent enhancement of accuracy ranging from 0.5% to 1.5% throughout all phases of the incremental learning setup.
arXiv Detail & Related papers (2022-11-02T18:16:17Z)
VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives [84.48039784446166]
We show that model FI supervision can meaningfully improve VQA model accuracy as well as performance on several Right-for-the-Right-Reason metrics. Our best performing method, Visual Feature Importance Supervision (VisFIS), outperforms strong baselines on benchmark VQA datasets. Predictions are more accurate when explanations are plausible and faithful, and not when they are plausible but not faithful.
arXiv Detail & Related papers (2022-06-22T17:02:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.