L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention
- URL: http://arxiv.org/abs/2511.17910v1
- Date: Sat, 22 Nov 2025 04:25:25 GMT
- Title: L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention
- Authors: Yuliang Zhan, Xinyu Tang, Han Wan, Jian Li, Ji-Rong Wen, Hao Sun,
- Abstract summary: Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs)<n>Existing approaches either need high training costs or require architectural alignment.<n>We propose L2V-CoT, a novel training-free latent intervention approach that transfers CoT reasoning from LLMs to VLMs.
- Score: 47.82350055363378
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs), but Vision-Language Models (VLMs) still struggle with multi-step reasoning tasks due to limited multimodal reasoning data. To bridge this gap, researchers have explored methods to transfer CoT reasoning from LLMs to VLMs. However, existing approaches either need high training costs or require architectural alignment. In this paper, we use Linear Artificial Tomography (LAT) to empirically show that LLMs and VLMs share similar low-frequency latent representations of CoT reasoning despite architectural differences. Based on this insight, we propose L2V-CoT, a novel training-free latent intervention approach that transfers CoT reasoning from LLMs to VLMs. L2V-CoT extracts and resamples low-frequency CoT representations from LLMs in the frequency domain, enabling dimension matching and latent injection into VLMs during inference to enhance reasoning capabilities. Extensive experiments demonstrate that our approach consistently outperforms training-free baselines and even surpasses supervised methods.
Related papers
- Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs [55.61018839017648]
Chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks.<n>Existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies.<n>We propose SAYO, a visual reasoning model trained with a reinforcement learning framework that introduces a region-level visual attention-based reward.
arXiv Detail & Related papers (2026-02-09T03:33:23Z) - LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs [63.580867975515474]
We present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs.<n>We propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation.
arXiv Detail & Related papers (2025-06-17T11:45:37Z) - Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models [19.361686225381447]
Vision-language models (VLMs) are widely assumed to exhibit in-context learning (ICL)<n>We propose a new MM-ICL with Reasoning pipeline that augments each demonstration with a generated rationale alongside the answer.
arXiv Detail & Related papers (2025-06-09T16:55:32Z) - Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning [95.44766931218896]
Multi-modal large language models (MLLMs) still lag behind text-based reasoning.<n>We introduce Perception-Reasoning Decoupling, which modularizes the MLLM's reasoning component and makes it easily replaceable.<n>We propose a novel reinforcement learning algorithm called Visual Perception Optimization (VPO) to align the MLLM's perceptual output with the final reasoning task.
arXiv Detail & Related papers (2025-06-05T02:28:07Z) - Misaligning Reasoning with Answers -- A Framework for Assessing LLM CoT Robustness [3.9930400744726273]
We design a novel evaluation framework, MATCHA, to investigate the relationship between answer and reasoning.<n>In domains like education and healthcare, reasoning is key for model trustworthiness.<n>Our results show that LLMs exhibit greater vulnerability to input perturbations for multi-step and commonsense tasks than compared to logical tasks.
arXiv Detail & Related papers (2025-05-23T02:42:16Z) - Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.<n>Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.<n>We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z) - On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models [25.029579061612456]
Large Language Models (LLMs) are increasingly being employed in real-world applications in critical domains such as healthcare.
It is important to ensure that the Chain-of-Thought (CoT) reasoning generated by these models faithfully captures their underlying behavior.
arXiv Detail & Related papers (2024-06-15T13:16:44Z) - Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs.
We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency.
We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.