Empowering Lightweight MLLMs with Reasoning via Long CoT SFT
- URL: http://arxiv.org/abs/2509.03321v2
- Date: Thu, 09 Oct 2025 03:53:53 GMT
- Title: Empowering Lightweight MLLMs with Reasoning via Long CoT SFT
- Authors: Linyu Ou, YuYang Yin,
- Abstract summary: This paper investigates the role of long Chain-of-Thought (long CoT) data in enhancing the reasoning abilities of lightweight multimodal language models (MLLMs)<n>Our findings demonstrate that Supervised Fine-Tuning (SFT) with long CoT data significantly improves MLLM reasoning.
- Score: 2.102184971295473
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Reinforcement Learning with Verifiable Rewards has enhanced the reasoning of large-scale language models (LLMs), its efficacy for lightweight multimodal language models (MLLMs) with fewer than seven billion parameters remains underexplored. This paper investigates the role of long Chain-of-Thought (long CoT) data in enhancing the reasoning abilities of such MLLMs. Our findings demonstrate that Supervised Fine-Tuning (SFT) with long CoT data significantly improves MLLM reasoning. Furthermore, we observe that after this initial SFT phase, MLLMs can achieve additional performance gains through a subsequent RL stage. We conclude that a SFT stage with long CoT data is a critical prerequisite for developing the reasoning capabilities of lightweight MLLMs.
Related papers
- L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention [47.82350055363378]
Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs)<n>Existing approaches either need high training costs or require architectural alignment.<n>We propose L2V-CoT, a novel training-free latent intervention approach that transfers CoT reasoning from LLMs to VLMs.
arXiv Detail & Related papers (2025-11-22T04:25:25Z) - RL makes MLLMs see better than SFT [96.508432109136]
We conduct a critical yet under-explored analysis of the vision encoder of Multimodal Language Model (MLLM)<n>Our results demonstrate that MLLM's post-training strategy (i.e., SFT or RL) not only leads to distinct outcomes on MLLM downstream tasks, but also fundamentally reshapes MLLM's underlying visual representations.<n>We then reframe our findings into a simple recipe for building strong vision encoders for MLLMs, Preference-Instructed Vision OpTimization (PIVOT)
arXiv Detail & Related papers (2025-10-18T03:37:17Z) - NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints [100.02131897927484]
This paper focuses on the native training of Multimodal Large Language Models (MLLMs) in an end-to-end manner.<n>We propose a native MLLM called NaViL, combined with a simple and cost-effective recipe.<n> Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs.
arXiv Detail & Related papers (2025-10-09T17:59:37Z) - What Factors Affect LLMs and RLLMs in Financial Question Answering? [4.42417272193095]
This study explores the impact of various methods on large language models (LLMs) and reasoning large language models (RLLMs) in the financial domain.<n>We utilize five LLMs and three RLLMs to assess the effects of prompting methods, agentic frameworks, and multilingual alignment methods on financial question-answering tasks.
arXiv Detail & Related papers (2025-07-11T06:37:44Z) - LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs [63.580867975515474]
We present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs.<n>We propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation.
arXiv Detail & Related papers (2025-06-17T11:45:37Z) - Grounded Chain-of-Thought for Multimodal Large Language Models [66.04061083611863]
We propose a new learning task for multimodal large language models (MLLMs) called Grounded Chain-of-Thought (GCoT)<n>GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis.<n>To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images.
arXiv Detail & Related papers (2025-03-17T04:07:47Z) - LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [72.68665884790002]
We propose a novel framework to transfer knowledge from l-MLLMs to s-MLLMs.<n>We introduce Multimodal Distillation (MDist) to transfer teacher model's robust representations across both visual and linguistic modalities.<n>We also propose a three-stage training scheme to fully exploit the potential of the proposed distillation strategy.
arXiv Detail & Related papers (2024-10-21T17:41:28Z) - 60 Data Points are Sufficient to Fine-Tune LLMs for Question-Answering [50.12622877002846]
Large language models (LLMs) encode extensive world knowledge through pre-training on massive datasets, which can be fine-tuned for the question-answering (QA) task.<n>We categorize supervised fine-tuning (SFT) data based on the extent of knowledge memorized by the pretrained LLMs.<n>Our experiments show that as few as 60 data points during the SFT stage can activate the knowledge encoded during pre-training, enabling LLMs to perform the QA task.
arXiv Detail & Related papers (2024-09-24T07:38:38Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression [40.4998607679863]
Large Language Models (LLMs) often suffer from catastrophic forgetting when post-pretrained or supervised fine-tuned (SFT) on domain-specific data.
This paper focuses on TG-SFT, which can synthetically generate SFT data for the instruction tuning steps.
arXiv Detail & Related papers (2024-06-17T09:17:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.