Related papers: CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation

CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation

URL: http://arxiv.org/abs/2601.08010v1
Date: Mon, 12 Jan 2026 21:24:45 GMT
Title: CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation
Authors: Chaoyu Li, Deeparghya Dutta Barua, Fei Tao, Pooyan Fazli,
Abstract summary: We introduce two complementary approaches inspired by test-time scaling to stabilize vision-language models.<n>CASHEW is an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces.<n>CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence.
Score: 6.356820150960838
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models achieve strong performance across a wide range of multimodal understanding and reasoning tasks, yet their multi-step reasoning remains unstable. Repeated sampling over the same input often produces divergent reasoning trajectories and inconsistent final predictions. To address this, we introduce two complementary approaches inspired by test-time scaling: (1) CASHEW, an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces, with explicit visual verification filtering hallucinated steps and grounding reasoning in visual evidence, and (2) CASHEW-RL, a learned variant that internalizes this aggregation behavior within a single model. CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence, while adaptively allocating reasoning effort based on task difficulty. This training objective enables robust self-aggregation at inference. Extensive experiments on 13 image understanding, video understanding, and video reasoning benchmarks show significant performance improvements, including gains of up to +23.6 percentage points on ScienceQA and +8.1 percentage points on EgoSchema.

Related papers

See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs [24.90876091319589]
We present an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning.<n>Our key idea is to supervise each reasoning step at test time with visual evidence.<n>Our method achieves 16.5%-29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench.
arXiv Detail & Related papers (2026-02-25T02:13:59Z)
Ground What You See: Hallucination-Resistant MLLMs via Caption Feedback, Diversity-Aware Sampling, and Conflict Regularization [38.469173375694076]
This paper systematically analyzes the root causes of hallucinations in Multimodal Large Language Models (MLLMs)<n>It identifies three critical factors: (1) an over-reliance on chained visual reasoning, where inaccurate initial descriptions anchor subsequent inferences to incorrect premises; (2) insufficient exploration diversity during policy optimization, leading the model to generate overly confident but erroneous outputs; and (3) destructive conflicts between training samples, where NTK similarity causes false associations and unstable parameter updates.<n> Experimental results demonstrate that our proposed method significantly reduces hallucination rates and effectively enhances the inference accuracy of MLLMs.
arXiv Detail & Related papers (2026-01-09T07:59:18Z)
Adversarial Yet Cooperative: Multi-Perspective Reasoning in Retrieved-Augmented Language Models [72.4149653187766]
We propose a Reasoner-Verifier framework named Adrialversa Reasoning RAG (ARR)<n>The Reasoner and Verifier engage in reasoning on retrieved evidence and critiquing each other's logic while being guided by process-aware advantage.<n> Experiments on multiple benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2026-01-08T06:57:03Z)
ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning [44.49803237328707]
ReVSeg executes reasoning as sequential decisions in the native interface of pretrained vision language models.<n>We employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals.
arXiv Detail & Related papers (2025-12-02T14:44:12Z)
Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models [56.851611990473174]
Reasoning over dynamic visual content remains a central challenge for large language models.<n>We propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency.<n>The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks.
arXiv Detail & Related papers (2025-11-28T18:59:58Z)
Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning [50.20267980386502]
We learn a dense, token-level reward model for process supervision directly from expert demonstrations.<n>The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets.
arXiv Detail & Related papers (2025-10-02T09:55:26Z)
Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute [60.151643048803145]
We propose Fractional Reasoning, a framework that enables continuous control over reasoning intensity at inference time.<n>Our method operates by extracting the latent steering vector associated with deeper reasoning and reapplying it with a tunable scaling factor.<n> Experiments on GSM8K, MATH500, and GPQA demonstrate that Fractional Reasoning consistently improves performance across diverse reasoning tasks and models.
arXiv Detail & Related papers (2025-06-18T21:15:59Z)
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z)
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.