Context-Picker: Dynamic context selection using multi-stage reinforcement learning
- URL: http://arxiv.org/abs/2512.14465v1
- Date: Tue, 16 Dec 2025 14:52:11 GMT
- Title: Context-Picker: Dynamic context selection using multi-stage reinforcement learning
- Authors: Siyuan Zhu, Chengdong Xu, Kaiqiang Ke, Chao Yu,
- Abstract summary: We introduce emphContext-Picker, a reasoning-aware framework for long-context question answering.<n>Context-Picker treats context selection as a decision-making process optimized via a human-inspired, two-stage reinforcement learning schedule.<n>Experiments on five long-context and multi-hop QA benchmarks demonstrate that Context-Picker significantly outperforms strong RAG baselines.
- Score: 4.539896456749749
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In long-context question answering (LCQA), determining the optimal amount of context for a given query is a significant challenge. Including too few passages may omit critical information, while including too many can introduce noise and reduce the quality of the answer. Traditional approaches, such as fixed Top-$K$ retrieval and single-stage reranking, face the dilemma of selecting the right number of passages. This problem is particularly pronounced for factoid questions, which often require only a few specific pieces of evidence. To address this issue, we introduce \emph{Context-Picker}, a reasoning-aware framework that shifts the paradigm from similarity-based ranking to minimal sufficient subset selection. Context-Picker treats context selection as a decision-making process optimized via a human-inspired, two-stage reinforcement learning schedule: a \emph{recall-oriented} stage that prioritizes the coverage of reasoning chains, followed by a \emph{precision-oriented} stage that aggressively prunes redundancy to distill a compact evidence set. To resolve reward sparsity, we propose an offline evidence distillation pipeline that mines "minimal sufficient sets" via a Leave-One-Out (LOO) procedure, providing dense, task-aligned supervision. Experiments on five long-context and multi-hop QA benchmarks demonstrate that Context-Picker significantly outperforms strong RAG baselines, achieving superior answer accuracy with comparable or reduced context lengths. Ablation studies indicate that the coarse-to-fine optimization schedule, the redundancy-aware reward shaping, and the rationale-guided format all contribute substantially to these gains.
Related papers
- More Bias, Less Bias: BiasPrompting for Enhanced Multiple-Choice Question Answering [53.09478307383865]
We introduce BiasPrompting, a novel inference framework for large language models (LLMs)<n>It guides LLMs to generate and critically evaluate reasoning across all plausible answer options before reaching a final prediction.<n>It demonstrates significant improvements in five widely used multiple-choice question answering benchmarks.
arXiv Detail & Related papers (2025-11-25T09:01:08Z) - Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models [64.49342399229529]
We argue that post-hoc attribution can be reframed as a reasoning problem, where answers are decomposed into constituent units, each tied to specific context.<n>We introduce DecompTune, a post-training method that teaches models to produce answer decompositions as intermediate reasoning steps.<n>Across extensive experiments and ablations, DecompTune substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models.
arXiv Detail & Related papers (2025-10-29T17:58:59Z) - Resource-Friendly Dynamic Enhancement Chain for Multi-Hop Question Answering [21.077964610022313]
This work proposes a novel framework called DEC (Dynamic Enhancement Chain)<n> DEC first decomposes complex questions into logically coherent subquestions to form a hallucination-free reasoning chain.<n>It then iteratively refines these subquestions through context-aware rewriting to generate effective query formulations.
arXiv Detail & Related papers (2025-06-21T11:55:27Z) - Reinforcing Video Reasoning with Focused Thinking [65.85683941058916]
We propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity.<n>Specifically, we employ a token weighting mechanism that prioritizes tokens with high informational density.<n>We also reformulate RL training by shifting from single-choice to multi-choice QA tasks.
arXiv Detail & Related papers (2025-05-30T15:42:19Z) - QA-prompting: Improving Summarization with Large Language Models using Question-Answering [0.8460698440162888]
Language Models (LMs) have revolutionized natural language processing, enabling high-quality text generation through prompting and in-context learning.<n>We propose QA-prompting - a simple prompting method for summarization that utilizes question-answering as an intermediate step prior to summary generation.<n>Our method extracts key information and enriches the context of text to mitigate positional biases and improve summarization in a single LM call per task without requiring fine-tuning or pipelining.
arXiv Detail & Related papers (2025-05-20T13:29:36Z) - SAGE: A Framework of Precise Retrieval for RAG [9.889395372896153]
Retrieval-augmented generation (RAG) has demonstrated significant proficiency in conducting question-answering tasks.<n>RAG methods segment the corpus without considering semantics, making it difficult to find relevant context.<n>We introduce a RAG framework (SAGE) to overcome these limitations.
arXiv Detail & Related papers (2025-03-03T16:25:58Z) - Options-Aware Dense Retrieval for Multiple-Choice query Answering [5.098112872671412]
Long-context multiple-choice question answering tasks require robust reasoning over extensive text sources.<n>Prior research in this domain has predominantly utilized pre-trained dense retrieval models.<n>This paper proposes a novel method called Options Aware Dense Retrieval (OADR) to address these challenges.
arXiv Detail & Related papers (2025-01-27T15:03:26Z) - QUITO-X: A New Perspective on Context Compression from the Information Bottleneck Theory [75.81394991657545]
We introduce information bottleneck theory (IB) to model the problem.<n>We propose a cross-attention-based approach to approximate mutual information in IB.<n>Our method achieves a 25% increase in compression rate compared to the state-of-the-art.
arXiv Detail & Related papers (2024-08-20T02:44:45Z) - Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models [84.15513004135576]
Current research enhances the reasoning performance of Large Language Models (LLMs) by sampling multiple reasoning chains and ensembling based on the answer frequency.
This approach fails in scenarios where the correct answers are in the minority.
We introduce a hierarchical reasoning aggregation framework AoR, which selects answers based on the evaluation of reasoning chains.
arXiv Detail & Related papers (2024-05-21T17:12:19Z) - DCR: Divide-and-Conquer Reasoning for Multi-choice Question Answering with LLMs [9.561022942046279]
We propose Divide and Conquer Reasoning (DCR) to enhance the reasoning capability of large language models (LLMs)
We first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
In particular, we first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
arXiv Detail & Related papers (2024-01-10T14:38:46Z) - Complexity-Based Prompting for Multi-Step Reasoning [72.0057198610614]
We study the task of prompting large-scale language models to perform multi-step reasoning.
A central question is which reasoning examples make the most effective prompts.
We propose complexity-based prompting, a simple and effective example selection scheme for multi-step reasoning.
arXiv Detail & Related papers (2022-10-03T05:33:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.