Related papers: Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

URL: http://arxiv.org/abs/2603.05256v1
Date: Thu, 05 Mar 2026 15:08:06 GMT
Title: Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum
Authors: Shan Ning, Longtian Qiu, Xuming He,
Abstract summary: Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions about an image by integrating external knowledge.<n>We propose textitWiki-R1, a data-generation-based curriculum reinforcement learning framework.<n>Experiments on two KB-VQA benchmarks, Encyclopedic VQA and InfoSeek, demonstrate that Wiki-R1 achieves new state-of-the-art results.
Score: 19.69940315540221
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions about an image by integrating external knowledge, posing significant challenges due to noisy retrieval and the structured, encyclopedic nature of the knowledge base. These characteristics create a distributional gap from pretrained multimodal large language models (MLLMs), making effective reasoning and domain adaptation difficult in the post-training stage. In this work, we propose \textit{Wiki-R1}, a data-generation-based curriculum reinforcement learning framework that systematically incentivizes reasoning in MLLMs for KB-VQA. Wiki-R1 constructs a sequence of training distributions aligned with the model's evolving capability, bridging the gap from pretraining to the KB-VQA target distribution. We introduce \textit{controllable curriculum data generation}, which manipulates the retriever to produce samples at desired difficulty levels, and a \textit{curriculum sampling strategy} that selects informative samples likely to yield non-zero advantages during RL updates. Sample difficulty is estimated using observed rewards and propagated to unobserved samples to guide learning. Experiments on two KB-VQA benchmarks, Encyclopedic VQA and InfoSeek, demonstrate that Wiki-R1 achieves new state-of-the-art results, improving accuracy from 35.5\% to 37.1\% on Encyclopedic VQA and from 40.1\% to 44.1\% on InfoSeek. The project page is available at https://artanic30.github.io/project_pages/WikiR1/.

Related papers

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering [54.72902502486611]
ReAG is a Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages.<n>ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence.
arXiv Detail & Related papers (2025-11-27T19:01:02Z)
SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection [81.78173888579941]
Large Language Models (LLMs) are considered a well-suited method to increase the quality of the question-answering functionality.<n>LLMs are trained on web data, where researchers have no control over whether the benchmark or the knowledge graph was already included in the training data.<n>This paper introduces a novel method that evaluates the quality of LLMs by generating a SPARQL query from a natural-language question.
arXiv Detail & Related papers (2025-07-18T12:28:08Z)
GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance [0.9208007322096533]
Knowledge-Based Visual Question Answering (KB-VQA) methods focus on tasks that demand reasoning with information extending beyond the explicit content depicted in the image.<n>Recent approaches leverage Large Language Models (LLMs) as implicit knowledge sources.<n>We introduce a novel four-stage framework called Grounding Caption-Guided Knowledge-Based Visual Question Answering (GC-KBVQA)<n> Innovations include grounding question-aware caption generation to move beyond generic descriptions and have compact, yet detailed and context-rich information.
arXiv Detail & Related papers (2025-05-25T23:00:30Z)
Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering [12.622529359686016]
Visual Question Answering (VQA) focuses on providing answers to natural language questions by utilizing information from images.<n>Retrieval-augmented generation (RAG) leveraging external knowledge bases (KBs) emerges as a promising approach.<n>This study presents two key innovations. First, we introduce fine-grained knowledge units that consist of multimodal data fragments.<n>Second, we propose a knowledge unit retrieval-augmented generation framework (KU-RAG) that seamlessly integrates fine-grained retrieval with MLLMs.
arXiv Detail & Related papers (2025-02-28T11:25:38Z)
Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering [44.54319663913782]
We propose textbfRetrieval-textbfAugmented MLLMs with Compressed Contexts (RACC)<n>RACC learns to compress and aggregate retrieved knowledge for a given image-question pair.<n>It achieves a state-of-the-art (SOTA) performance of 63.92% on OK-VQA.
arXiv Detail & Related papers (2024-09-11T15:11:39Z)
Few-shot Transfer Learning for Knowledge Base Question Answering: Fusing Supervised Models with In-Context Learning [20.80841972133938]
Existing Knowledge Base Question Answering (KBQA) architectures are hungry for annotated data. We introduce the problem of few-shot transfer learning for KBQA, where the target domain offers only a few labeled examples. We propose a novel KBQA architecture called FuSIC-KBQA that performs KB-retrieval using multiple source-trained retrievers.
arXiv Detail & Related papers (2023-11-15T11:56:56Z)
Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases. We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity) Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z)
Self-Prompting Large Language Models for Zero-Shot Open-Domain QA [67.08732962244301]
Open-Domain Question Answering (ODQA) aims to answer questions without explicitly providing background documents. This task becomes notably challenging in a zero-shot setting where no data is available to train tailored retrieval-reader models. We propose a Self-Prompting framework to explicitly utilize the massive knowledge encoded in the parameters of Large Language Models.
arXiv Detail & Related papers (2022-12-16T18:23:43Z)
Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases [63.43418760818188]
We release a new large-scale, high-quality dataset with 64,331 questions, GrailQA. We propose a novel BERT-based KBQA model. The combination of our dataset and model enables us to thoroughly examine and demonstrate, for the first time, the key role of pre-trained contextual embeddings like BERT in the generalization of KBQA.
arXiv Detail & Related papers (2020-11-16T06:36:26Z)
Template-Based Question Generation from Retrieved Sentences for Improved Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data. We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z)
Unshuffling Data for Improved Generalization [65.57124325257409]
Generalization beyond the training distribution is a core challenge in machine learning. We show that partitioning the data into well-chosen, non-i.i.d. subsets treated as multiple training environments can guide the learning of models with better out-of-distribution generalization.
arXiv Detail & Related papers (2020-02-27T03:07:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.