Related papers: Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning

Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning

URL: http://arxiv.org/abs/2507.07424v1
Date: Thu, 10 Jul 2025 04:31:56 GMT
Title: Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning
Authors: Jingjing Jiang, Chao Ma, Xurui Song, Hanwang Zhang, Jun Luo,
Abstract summary: We present Corvid, an MLLM with enhanced chain-of-thought (CoT) reasoning capabilities.<n>To enhance Corvid's CoT reasoning capabilities, we introduce MCoT-Instruct-287K, a high-quality multimodal CoT instruction-following dataset.<n>We propose an effective inference-time scaling strategy that enables Corvid to mitigate over-reasoning and under-reasoning.
Score: 51.867949053263466
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in multimodal large language models (MLLMs) have demonstrated exceptional performance in multimodal perception and understanding. However, leading open-source MLLMs exhibit significant limitations in complex and structured reasoning, particularly in tasks requiring deep reasoning for decision-making and problem-solving. In this work, we present Corvid, an MLLM with enhanced chain-of-thought (CoT) reasoning capabilities. Architecturally, Corvid incorporates a hybrid vision encoder for informative visual representation and a meticulously designed connector (GateMixer) to facilitate cross-modal alignment. To enhance Corvid's CoT reasoning capabilities, we introduce MCoT-Instruct-287K, a high-quality multimodal CoT instruction-following dataset, refined and standardized from diverse public reasoning sources. Leveraging this dataset, we fine-tune Corvid with a two-stage CoT-formatted training approach to progressively enhance its step-by-step reasoning abilities. Furthermore, we propose an effective inference-time scaling strategy that enables Corvid to mitigate over-reasoning and under-reasoning through self-verification. Extensive experiments demonstrate that Corvid outperforms existing o1-like MLLMs and state-of-the-art MLLMs with similar parameter scales, with notable strengths in mathematical reasoning and science problem-solving. Project page: https://mm-vl.github.io/corvid.

Related papers

Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought [11.538345159297839]
Chain-of-thought (CoT) prompting has been adapted for large vision-language models (LLMs) to enhance multi-modal reasoning.<n>Existing LVLMs often ignore the contents of generated rationales in CoT reasoning.<n>We propose rationale-enhanced decoding (RED), a novel plug-and-play inference-time decoding strategy.
arXiv Detail & Related papers (2025-07-10T12:07:13Z)
Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward [87.06604760273372]
We propose Perception-R1, which introduces a novel visual perception reward that explicitly encourages MLLMs to perceive the visual content accurately.<n>We show that Perception-R1 achieves state-of-the-art performance on most benchmarks using only 1,442 training data.
arXiv Detail & Related papers (2025-06-08T16:48:42Z)
VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning [33.170426237654596]
VIDEORFT is a novel approach to cultivate human-like video reasoning capabilities in MLLMs.<n>It follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization.<n>It achieves state-of-the-art performance on six video reasoning benchmarks.
arXiv Detail & Related papers (2025-05-18T14:14:35Z)
Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning [3.364797975300393]
We present Observe-R1, a novel framework aimed at enhancing the reasoning capabilities of multimodal large language models (MLLMs)<n>We construct the NeuraLadder dataset, which is organized and sampled according to the difficulty and complexity of data samples for RL training.<n>Experiments with the Qwen2.5-VL-3B and Qwen2.5-VL-7B models on 20k samples from the NeuraLadder dataset show that Observe-R1 outperforms a series of larger reasoning models on both reasoning and general benchmarks.
arXiv Detail & Related papers (2025-05-18T14:08:03Z)
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.<n>Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.<n>We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z)
Mentor-KD: Making Small Language Models Better Multi-step Reasoners [15.159415340059388]
We propose Mentor-KD, which effectively distills the multi-step reasoning capability of LLMs to smaller LMs. We exploit a mentor, intermediate-sized task-specific fine-tuned model, to augment additional CoT annotations. We conduct extensive experiments and confirm Mentor-KD's effectiveness across various models and complex reasoning tasks.
arXiv Detail & Related papers (2024-10-11T17:53:27Z)
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders [89.41055673919895]
This study explores the design space for MLLMs using a mixture of vision encoders and resolutions.<n>We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies.<n>The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.
arXiv Detail & Related papers (2024-08-28T17:59:31Z)
Fact :Teaching MLLMs with Faithful, Concise and Transferable Rationales [102.54274021830207]
We introduce Fact, a novel paradigm designed to generate multimodal rationales that are faithful, concise, and transferable for teaching MLLMs. We filter rationales that can be transferred to end-to-end paradigms from programming paradigms to guarantee transferability. Our approach also reduces hallucinations owing to its high correlation between images and text.
arXiv Detail & Related papers (2024-04-17T07:20:56Z)
ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting [124.69672273754144]
Chain-of-Thought (CoT) prompting can enhance the reasoning capabilities of large language models (LLMs) Existing CoT approaches usually focus on simpler reasoning tasks and thus result in low-quality and inconsistent CoT prompts. We introduce CoTGenius, a novel framework designed for the automatic generation of superior CoT prompts.
arXiv Detail & Related papers (2024-03-21T11:34:26Z)
Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE [83.00018517368973]
Large Language Models (LLMs) can extend their zero-shot capabilities to multimodal learning through instruction tuning. negative conflicts and interference may have a worse impact on performance. We combine the well-known Mixture-of-Experts (MoE) and one of the representative PEFT techniques, i.e., LoRA, designing a novel LLM-based decoder, called LoRA-MoE, for multimodal learning.
arXiv Detail & Related papers (2023-11-05T15:48:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.