R3G: A Reasoning--Retrieval--Reranking Framework for Vision-Centric Answer Generation
- URL: http://arxiv.org/abs/2602.00104v1
- Date: Sun, 25 Jan 2026 12:12:12 GMT
- Title: R3G: A Reasoning--Retrieval--Reranking Framework for Vision-Centric Answer Generation
- Authors: Zhuohong Chen, Zhengxian Wu, Zirui Liao, Shenao Jiang, Hangrui Xu, Yang Chen, Chaokui Su, Xiaoyu Liu, Haoqian Wang,
- Abstract summary: Vision-centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process.<n>We propose R3G, a modular Reasoning-Retrieval-Reranking framework.<n>It produces a brief reasoning plan that specifies the required visual cues, then adopts a two-stage strategy, with coarse retrieval followed by fine-grained reranking, to select evidence images.
- Score: 24.755888254171342
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process. However, selecting the right images and integrating them effectively into the model's reasoning remains challenging.To address this challenge, we propose R3G, a modular Reasoning-Retrieval-Reranking framework.It first produces a brief reasoning plan that specifies the required visual cues, then adopts a two-stage strategy, with coarse retrieval followed by fine-grained reranking, to select evidence images.On MRAG-Bench, R3G improves accuracy across six MLLM backbones and nine sub-scenarios, achieving state-of-the-art overall performance. Ablations show that sufficiency-aware reranking and reasoning steps are complementary, helping the model both choose the right images and use them well. We release code and data at https://github.com/czh24/R3G.
Related papers
- VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning [49.610569478718226]
multimodal reward models (RMs) have substantially improved post-training for visual generative models.<n>VideoReward Thinker (VR-Thinker) is a thinking-with-image framework that equips the RM with visual reasoning operations and a visual memory window.<n>Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks.
arXiv Detail & Related papers (2025-10-12T09:29:50Z) - VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation [64.82775032985485]
Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations.<n>Yet current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions.<n>We propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue.
arXiv Detail & Related papers (2025-10-10T13:34:23Z) - ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models [11.263321053154364]
ERGO is a reasoning-driven perception-leveraging multimodal context to determine where to focus.<n>We develop simple yet effective reward components in a reinforcement learning framework for coarse-to-fine perception.<n>Our approach delivers higher accuracy than the original model and competitive methods, with greater efficiency.
arXiv Detail & Related papers (2025-09-26T07:15:19Z) - From Sufficiency to Reflection: Reinforcement-Guided Thinking Quality in Retrieval-Augmented Reasoning for LLMs [13.410543801811992]
This paper analyzes existing RAG reasoning models and identifies three main failure patterns.<n>We propose TIRESRAG-R1, a novel framework using a think-retrieve-reflect process and a multi-dimensional reward system.<n>Experiments on four multi-hop QA datasets show that TIRESRAG-R1 outperforms prior RAG methods and generalizes well to single-hop tasks.
arXiv Detail & Related papers (2025-07-30T14:29:44Z) - SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification [74.36139886192495]
We propose a novel generative framework named SD-ReID for AG-ReID.<n>We first train a ViT-based model to extract person representations along with controllable conditions, including identity and view conditions.<n>We then fine-tune the Stable Diffusion (SD) model to enhance person representations guided by these controllable conditions.
arXiv Detail & Related papers (2025-04-13T12:44:50Z) - Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [86.69947123512836]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks.<n>We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation.<n>We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z) - Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval [28.018754406453937]
Composed Image Retrieval (CIR) aims to retrieve target images that closely resemble a reference image.<n>We present One-Stage Reflective Chain-of-Thought Reasoning for ZS-CIR (OSrCIR)<n>OSrCIR achieves performance gains of 1.80% to 6.44% over existing training-free methods across multiple tasks.
arXiv Detail & Related papers (2024-12-15T06:22:20Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval [92.13664084464514]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent.
Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments.
We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
arXiv Detail & Related papers (2024-03-03T07:58:03Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - Analysis on Image Set Visual Question Answering [0.3359875577705538]
We tackle the challenge of Visual Question Answering in multi-image setting.
Traditional VQA tasks have focused on a single-image setting where the target answer is generated from a single image.
In this report, we work with 4 approaches in a bid to improve the performance on the task.
arXiv Detail & Related papers (2021-03-31T20:47:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.