Related papers: Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains

Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains

URL: http://arxiv.org/abs/2512.22545v1
Date: Sat, 27 Dec 2025 10:14:14 GMT
Title: Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains
Authors: Jesen Zhang, Ningyuan Liu, Kaitong Cai, Sidi Liu, Jing Yang, Ziliang Chen, Xiaofei Sun, Keze Wang,
Abstract summary: Multimodal LLMs produce fluent yet unreliable reasoning.<n>We introduce SR-MCR, a lightweight and label-free framework that aligns reasoning.<n> SR-MCR improves both answer accuracy and reasoning coherence across a broad set of visual benchmarks.
Score: 16.357026482329232
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal LLMs often produce fluent yet unreliable reasoning, exhibiting weak step-to-step coherence and insufficient visual grounding, largely because existing alignment approaches supervise only the final answer while ignoring the reliability of the intermediate reasoning process. We introduce SR-MCR, a lightweight and label-free framework that aligns reasoning by exploiting intrinsic process signals derived directly from model outputs. Five self-referential cues -- semantic alignment, lexical fidelity, non-redundancy, visual grounding, and step consistency -- are integrated into a normalized, reliability-weighted reward that provides fine-grained process-level guidance. A critic-free GRPO objective, enhanced with a confidence-aware cooling mechanism, further stabilizes training and suppresses trivial or overly confident generations. Built on Qwen2.5-VL, SR-MCR improves both answer accuracy and reasoning coherence across a broad set of visual benchmarks; among open-source models of comparable size, SR-MCR-7B achieves state-of-the-art performance with an average accuracy of 81.4%. Ablation studies confirm the independent contributions of each reward term and the cooling module.

Related papers

Recursive Think-Answer Process for LLMs and VLMs [54.52289112197118]
We propose an efficient Recursive Think-Answer Process (R-TAP)<n>R-TAP enables models to engage in iterative reasoning cycles and generate more accurate answers.<n>We show that R-TAP-enhanced models consistently outperform conventional single-pass methods.
arXiv Detail & Related papers (2026-03-02T17:20:10Z)
VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction [55.04308051033549]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a dominant paradigm for enhancing Large Language Models (LLMs) reasoning.<n>We introduceVerifier-Independent Curriculum Reinforcement Learning (VI-CuRL), a framework that leverages the model's intrinsic confidence to construct a curriculum independent from external verifiers.
arXiv Detail & Related papers (2026-02-13T03:40:52Z)
Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding [54.05243949024302]
Existing robust MLLMs rely on implicit training/adaptation that focuses solely on visual encoder generalization.<n>We propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains.<n>Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity.
arXiv Detail & Related papers (2025-12-19T12:56:17Z)
SSR: Socratic Self-Refine for Large Language Model Reasoning [78.62319252287938]
Socratic Self-Refine (SSR) is a novel framework for fine-grained evaluation and precise refinement of Large Language Models (LLMs)<n>Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation.<n> Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines.
arXiv Detail & Related papers (2025-11-13T18:47:07Z)
RaCoT: Plug-and-Play Contrastive Example Generation Mechanism for Enhanced LLM Reasoning Reliability [12.67288560758937]
We propose RaCoT (Retrieval-aware Contrastive-of-Thought), a novel framework that shifts contrastive thinking to the pre-retrieval stage.<n>RaCoT guides the model to proactively focus on the critical details that determine answer divergence"
arXiv Detail & Related papers (2025-10-26T15:06:44Z)
Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning [29.778703252962092]
Reinforcement Learning (RL) has emerged as a powerful paradigm for advancing Large Language Models (LLMs)<n>We develop a novel test-time reward mechanism that operates without external supervision.
arXiv Detail & Related papers (2025-10-20T07:53:51Z)
Leveraging Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models [0.026861992804651083]
This paper introduces a robust framework for identifying the most trustworthy SR sample from a diffusion-generated set.<n>We propose a novel Trustworthiness Score (TWS) a hybrid metric that quantifies SR reliability based on semantic similarity.<n>By aligning outputs with human expectations and semantic correctness, this work sets a new benchmark for trustworthiness in generative SR.
arXiv Detail & Related papers (2025-06-25T21:00:44Z)
ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning [64.93140713419561]
Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs.<n>Existing fine-tuning-based compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection.<n>We introduce ConCISE, a framework designed to generate concise reasoning chains, integrating Confidence Injection to boost reasoning confidence, and Early Stopping to terminate reasoning when confidence is sufficient.
arXiv Detail & Related papers (2025-05-08T01:40:40Z)
Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.