Related papers: RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning

RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning

URL: http://arxiv.org/abs/2508.13229v3
Date: Mon, 15 Sep 2025 16:19:34 GMT
Title: RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning
Authors: Suhang Hu, Wei Hu, Yuhang Su, Fan Zhang,
Abstract summary: Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales.<n>Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training.<n>We introduce RISE (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations.
Score: 10.797460135169763
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training. We introduce RISE (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations. In the Reason stage (RISE-CoT), a reinforcement learning-driven "annotation-reasoning-annotation" closed-loop generates visually grounded, logically consistent CoTs by verifying their ability to reconstruct original annotations without direct leakage. The Inspire and Strengthen stage (RISE-R1) leverages a high-quality CoT subset, filtered by RISE-CoT rewards, for supervised fine-tuning, followed by reinforcement fine-tuning to produce interpretable reasoning and accurate annotations, achieving Expertise in complex visual tasks. Evaluated on complex and simple image annotation tasks, RISE-trained Qwen2-VL-2B outperforms SFT and Visual-RFT, achieving robust performance and enhanced explainability. RISE offers a self-supervised solution for advancing VLM reasoning without requiring manually annotated CoTs.Code and resources are available at: https://github.com/HSH55/RISE.

Related papers

CrystaL: Spontaneous Emergence of Visual Latents in MLLMs [55.34169914483764]
We propose CrystaL (Crystallized Latent Reasoning), a single-stage framework with two paths to process intact and corrupted images.<n>By explicitly aligning the attention patterns and prediction distributions across the two paths, CrystaL crystallizes latent representations into task-relevant visual semantics.<n>Experiments on perception-intensive benchmarks demonstrate that CrystaL consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-24T15:01:30Z)
On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs [15.301640007799735]
We show that simple, controlled textual perturbations--misleading captions or incorrect chain-of-thought (CoT) traces--cause substantial drops in robustness and confidence.<n>To better understand these vulnerabilities, we analyze RL fine-tuning dynamics and uncover an accuracy-faithfulness trade-off.
arXiv Detail & Related papers (2026-02-13T01:12:00Z)
Fine-tuning Pre-trained Vision-Language Models in a Human-Annotation-Free Manner [46.140724013144194]
Large-scale vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization, but adapting them to downstream tasks typically requires costly labeled data.<n>Existing unsupervised self-training methods rely on pseudo-labeling, yet often suffer from unreliable confidence filtering, confirmation bias, and underutilization of low-confidence samples.<n>We propose Collaborative Fine-Tuning (CoFT), an unsupervised adaptation framework that leverages unlabeled data through a dual-model, cross-modal collaboration mechanism.
arXiv Detail & Related papers (2026-02-04T09:00:12Z)
QualiRAG: Retrieval-Augmented Generation for Visual Quality Understanding [80.66379018208568]
Visual quality assessment is shifting from prediction to interpretable quality understanding.<n>Current approaches rely on supervised fine-tuning or reinforcement learning on curated instruction.<n>We propose VbfQualiRAG, a framework that systematically leverages latent perceptual knowledge of large multimodal models for visual quality perception.
arXiv Detail & Related papers (2026-01-26T06:27:03Z)
CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs [53.199517625701475]
CoG is a training-free framework inspired by Dual-Process Theory that mimics the interplay between intuition and deliberation.<n>CoG significantly outperforms state-of-the-art approaches in both accuracy and efficiency.
arXiv Detail & Related papers (2026-01-16T07:27:40Z)
Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification? [18.16727716373833]
Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC)<n>We propose ReFine-RFT, a framework that combines ensemble rewards with alg to constrain reasoning length while providing dense accuracy-oriented feedback.
arXiv Detail & Related papers (2026-01-11T17:07:47Z)
Think Bright, Diffuse Nice: Enhancing T2I-ICL via Inductive-Bias Hint Instruction and Query Contrastive Decoding [10.961445998450008]
Text-to-Image In-Context Learning enables customized image synthesis via interleaved text-image examples.<n>Existing methods rely on tailored training, which limits flexibility and raises deployment costs.<n>We propose TBDN, a training-free framework integrating two complementary closed-loop mechanisms.
arXiv Detail & Related papers (2026-01-07T06:39:45Z)
A Reasoning Paradigm for Named Entity Recognition [16.86833034216367]
Reasoning framework is proposed for Named Entity Recognition.<n> framework consists of three stages: Chain of Thought (CoT) generation, CoT tuning, and reasoning enhancement.<n>Experiments show ReasoningNER demonstrates impressive cognitive ability in the NER task, achieving competitive performance.
arXiv Detail & Related papers (2025-11-15T01:31:43Z)
CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning [93.05917922306196]
Composed Image Retrieval (CIR) aims to find a target image from a reference image and a modification text.<n>CIR-CoT is the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning.
arXiv Detail & Related papers (2025-10-09T09:41:45Z)
Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories [58.988535279557546]
We introduce textbf sycophancy Mitigation through Adaptive Reasoning Trajectories.<n>We show that SMART significantly reduces sycophantic behavior while preserving strong performance on out-of-distribution inputs.
arXiv Detail & Related papers (2025-09-20T17:09:14Z)
Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following [10.119219532863767]
lazy reasoning during the thinking stage is the primary factor contributing to poor instruction adherence.<n>We propose a comprehensive framework designed to enable rigorous reasoning processes involving preview and self-checking.<n>Our Light-IF-32B model surpasses both larger open-source models such as DeepSeek-R1 and closed-source models like Doubao-1.6.
arXiv Detail & Related papers (2025-08-05T07:42:00Z)
T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation [60.620408007636016]
We propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores.<n>Our approach integrates Group Relative Policy Optimization into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains.
arXiv Detail & Related papers (2025-05-23T13:44:59Z)
Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning [19.28434717501445]
Visual reasoning abilities play a crucial role in understanding complex multimodal data.<n>Existing methods improve VLM reasoning via Chain-of-Thought supervised fine-tuning.<n>We propose Reason-RFT, a novel reinforcement fine-tuning framework.
arXiv Detail & Related papers (2025-03-26T17:38:06Z)
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z)
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs. We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency. We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z)
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding [6.798129852396113]
We introduce a simple and effective method to improve compositional reasoning in Vision-Language Models (VLMs) Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines.
arXiv Detail & Related papers (2023-06-15T03:26:28Z)
When Does Contrastive Learning Preserve Adversarial Robustness from Pretraining to Finetuning? [99.4914671654374]
We propose AdvCL, a novel adversarial contrastive pretraining framework. We show that AdvCL is able to enhance cross-task robustness transferability without loss of model accuracy and finetuning efficiency.
arXiv Detail & Related papers (2021-11-01T17:59:43Z)
Visual Alignment Constraint for Continuous Sign Language Recognition [74.26707067455837]
Vision-based Continuous Sign Language Recognition aims to recognize unsegmented gestures from image sequences. In this work, we revisit the overfitting problem in recent CTC-based CSLR works and attribute it to the insufficient training of the feature extractor. We propose a Visual Alignment Constraint (VAC) to enhance the feature extractor with more alignment supervision.
arXiv Detail & Related papers (2021-04-06T07:24:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.