RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning
- URL: http://arxiv.org/abs/2508.13229v3
- Date: Mon, 15 Sep 2025 16:19:34 GMT
- Title: RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning
- Authors: Suhang Hu, Wei Hu, Yuhang Su, Fan Zhang,
- Abstract summary: Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales.<n>Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training.<n>We introduce RISE (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations.
- Score: 10.797460135169763
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training. We introduce RISE (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations. In the Reason stage (RISE-CoT), a reinforcement learning-driven "annotation-reasoning-annotation" closed-loop generates visually grounded, logically consistent CoTs by verifying their ability to reconstruct original annotations without direct leakage. The Inspire and Strengthen stage (RISE-R1) leverages a high-quality CoT subset, filtered by RISE-CoT rewards, for supervised fine-tuning, followed by reinforcement fine-tuning to produce interpretable reasoning and accurate annotations, achieving Expertise in complex visual tasks. Evaluated on complex and simple image annotation tasks, RISE-trained Qwen2-VL-2B outperforms SFT and Visual-RFT, achieving robust performance and enhanced explainability. RISE offers a self-supervised solution for advancing VLM reasoning without requiring manually annotated CoTs.Code and resources are available at: https://github.com/HSH55/RISE.
Related papers
- CrystaL: Spontaneous Emergence of Visual Latents in MLLMs [55.34169914483764]
We propose CrystaL (Crystallized Latent Reasoning), a single-stage framework with two paths to process intact and corrupted images.<n>By explicitly aligning the attention patterns and prediction distributions across the two paths, CrystaL crystallizes latent representations into task-relevant visual semantics.<n>Experiments on perception-intensive benchmarks demonstrate that CrystaL consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-24T15:01:30Z) - On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs [15.301640007799735]
We show that simple, controlled textual perturbations--misleading captions or incorrect chain-of-thought (CoT) traces--cause substantial drops in robustness and confidence.<n>To better understand these vulnerabilities, we analyze RL fine-tuning dynamics and uncover an accuracy-faithfulness trade-off.
arXiv Detail & Related papers (2026-02-13T01:12:00Z) - Fine-tuning Pre-trained Vision-Language Models in a Human-Annotation-Free Manner [46.140724013144194]
Large-scale vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization, but adapting them to downstream tasks typically requires costly labeled data.<n>Existing unsupervised self-training methods rely on pseudo-labeling, yet often suffer from unreliable confidence filtering, confirmation bias, and underutilization of low-confidence samples.<n>We propose Collaborative Fine-Tuning (CoFT), an unsupervised adaptation framework that leverages unlabeled data through a dual-model, cross-modal collaboration mechanism.
arXiv Detail & Related papers (2026-02-04T09:00:12Z) - QualiRAG: Retrieval-Augmented Generation for Visual Quality Understanding [80.66379018208568]
Visual quality assessment is shifting from prediction to interpretable quality understanding.<n>Current approaches rely on supervised fine-tuning or reinforcement learning on curated instruction.<n>We propose VbfQualiRAG, a framework that systematically leverages latent perceptual knowledge of large multimodal models for visual quality perception.
arXiv Detail & Related papers (2026-01-26T06:27:03Z) - CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs [53.199517625701475]
CoG is a training-free framework inspired by Dual-Process Theory that mimics the interplay between intuition and deliberation.<n>CoG significantly outperforms state-of-the-art approaches in both accuracy and efficiency.
arXiv Detail & Related papers (2026-01-16T07:27:40Z) - Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification? [18.16727716373833]
Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC)<n>We propose ReFine-RFT, a framework that combines ensemble rewards with alg to constrain reasoning length while providing dense accuracy-oriented feedback.
arXiv Detail & Related papers (2026-01-11T17:07:47Z) - Think Bright, Diffuse Nice: Enhancing T2I-ICL via Inductive-Bias Hint Instruction and Query Contrastive Decoding [10.961445998450008]
Text-to-Image In-Context Learning enables customized image synthesis via interleaved text-image examples.<n>Existing methods rely on tailored training, which limits flexibility and raises deployment costs.<n>We propose TBDN, a training-free framework integrating two complementary closed-loop mechanisms.
arXiv Detail & Related papers (2026-01-07T06:39:45Z) - A Reasoning Paradigm for Named Entity Recognition [16.86833034216367]
Reasoning framework is proposed for Named Entity Recognition.<n> framework consists of three stages: Chain of Thought (CoT) generation, CoT tuning, and reasoning enhancement.<n>Experiments show ReasoningNER demonstrates impressive cognitive ability in the NER task, achieving competitive performance.
arXiv Detail & Related papers (2025-11-15T01:31:43Z) - CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning [93.05917922306196]
Composed Image Retrieval (CIR) aims to find a target image from a reference image and a modification text.<n>CIR-CoT is the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning.
arXiv Detail & Related papers (2025-10-09T09:41:45Z) - Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories [58.988535279557546]
We introduce textbf sycophancy Mitigation through Adaptive Reasoning Trajectories.<n>We show that SMART significantly reduces sycophantic behavior while preserving strong performance on out-of-distribution inputs.
arXiv Detail & Related papers (2025-09-20T17:09:14Z) - Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following [10.119219532863767]
lazy reasoning during the thinking stage is the primary factor contributing to poor instruction adherence.<n>We propose a comprehensive framework designed to enable rigorous reasoning processes involving preview and self-checking.<n>Our Light-IF-32B model surpasses both larger open-source models such as DeepSeek-R1 and closed-source models like Doubao-1.6.
arXiv Detail & Related papers (2025-08-05T07:42:00Z) - T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation [60.620408007636016]
We propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores.<n>Our approach integrates Group Relative Policy Optimization into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains.
arXiv Detail & Related papers (2025-05-23T13:44:59Z) - Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning [19.28434717501445]
Visual reasoning abilities play a crucial role in understanding complex multimodal data.<n>Existing methods improve VLM reasoning via Chain-of-Thought supervised fine-tuning.<n>We propose Reason-RFT, a novel reinforcement fine-tuning framework.
arXiv Detail & Related papers (2025-03-26T17:38:06Z) - STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z) - Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs.
We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency.
We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z) - Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding [6.798129852396113]
We introduce a simple and effective method to improve compositional reasoning in Vision-Language Models (VLMs)
Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework.
When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines.
arXiv Detail & Related papers (2023-06-15T03:26:28Z) - When Does Contrastive Learning Preserve Adversarial Robustness from
Pretraining to Finetuning? [99.4914671654374]
We propose AdvCL, a novel adversarial contrastive pretraining framework.
We show that AdvCL is able to enhance cross-task robustness transferability without loss of model accuracy and finetuning efficiency.
arXiv Detail & Related papers (2021-11-01T17:59:43Z) - Visual Alignment Constraint for Continuous Sign Language Recognition [74.26707067455837]
Vision-based Continuous Sign Language Recognition aims to recognize unsegmented gestures from image sequences.
In this work, we revisit the overfitting problem in recent CTC-based CSLR works and attribute it to the insufficient training of the feature extractor.
We propose a Visual Alignment Constraint (VAC) to enhance the feature extractor with more alignment supervision.
arXiv Detail & Related papers (2021-04-06T07:24:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.