A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning
- URL: http://arxiv.org/abs/2510.07958v1
- Date: Thu, 09 Oct 2025 08:53:31 GMT
- Title: A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning
- Authors: Fengji Zhang, Xinyao Niu, Chengyang Ying, Guancheng Lin, Zhongkai Hao, Zhou Fan, Chengen Huang, Jacky Keung, Bei Chen, Junyang Lin,
- Abstract summary: A$2$Search is an annotation-free, end-to-end training framework to recognize and handle ambiguity.<n> Experiments on eight open-domain QA benchmarks demonstrate that A$2$Search achieves new state-of-the-art performance.
- Score: 46.81869577197105
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts to handle ambiguity often rely on costly manual annotation, which is difficult to scale to multi-hop datasets such as HotpotQA and MuSiQue. In this paper, we present A$^2$Search, an annotation-free, end-to-end training framework to recognize and handle ambiguity. At its core is an automated pipeline that detects ambiguous questions and gathers alternative answers via trajectory sampling and evidence verification. The model is then optimized with RL using a carefully designed $\mathrm{AnsF1}$ reward, which naturally accommodates multiple answers. Experiments on eight open-domain QA benchmarks demonstrate that A$^2$Search achieves new state-of-the-art performance. With only a single rollout, A$^2$Search-7B yields an average $\mathrm{AnsF1}@1$ score of $48.4\%$ across four multi-hop benchmarks, outperforming all strong baselines, including the substantially larger ReSearch-32B ($46.2\%$). Extensive analyses further show that A$^2$Search resolves ambiguity and generalizes across benchmarks, highlighting that embracing ambiguity is essential for building more reliable QA systems. Our code, data, and model weights can be found at https://github.com/zfj1998/A2Search
Related papers
- RubberDuckBench: A Benchmark for AI Coding Assistants [5.198865387380684]
We present RubberDuckBench: a benchmark of questions about code, along with detailed rubrics for evaluating answers.<n>We evaluate a diverse set of 20 LLMs (proprietary & open-source) on answering these questions.<n>Grok 4 (69.29%), Claude Opus 4 (68.5%), and GPT-5 (67.8%) perform best overall, but do not exhibit pairwise significant superiority over the next 9 best performing models.
arXiv Detail & Related papers (2026-01-23T05:28:48Z) - Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries [53.99620546358492]
Real-world use cases often present RAG systems with complex queries for which relevant information is missing from the corpus or is incomplete.<n>Existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions.<n>We present the first pipeline for automatic, difficulty-controlled creation of un$underlinec$heatable, $underliner$ealistic, $underlineu$nanswerable, and $underlinem$ulti-hop.
arXiv Detail & Related papers (2025-10-13T21:38:04Z) - SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards [23.02076024811612]
DeepSeek-R1 has demonstrated powerful reasoning capabilities in the text domain through stable reinforcement learning (RL)<n>In this paper, we introduce SATORI ($textbfSpatially$ $textbfAnchored$ $textbfTask$ $textbfOptimization$ with $textbfRetextbfInforcement$ Learning), which decomposes VQA into three verifiable stages.<n> Experiments demonstrate consistent performance improvements across seven VQA benchmarks, achieving up to $15.7%$ improvement in
arXiv Detail & Related papers (2025-05-25T11:11:06Z) - Measuring short-form factuality in large language models [50.15055025275888]
We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions.
SimpleQA is adversarially collected against GPT-4 responses.
Each answer in SimpleQA is graded as either correct, incorrect, or not attempted.
arXiv Detail & Related papers (2024-11-07T01:58:42Z) - FLARE: Faithful Logic-Aided Reasoning and Exploration [47.46564769245296]
We introduce a novel approach for traversing the problem space using task decompositions.<n>We use the Large Language Models to plan a solution, soft-formalise the query into facts and predicates using a logic programming code.<n>Our method allows us to compute the faithfulness of the reasoning process w.r.t. the generated code and analyse the steps of the multi-hop search without relying on external solvers.
arXiv Detail & Related papers (2024-10-14T19:39:11Z) - Boosting Logical Reasoning in Large Language Models through a New
Framework: The Graph of Thought [7.356034193515096]
Our paper unveils a pioneering prompting technique, dubbed textitGraph of Thoughts (GoT).
Our method outperformed GPT-4, achieving accuracy improvements of $89.7%$, $86%$, and $56%$ for each respective task.
When juxtaposed with the state-of-the-art prompting method, textitTree of Thought (ToT), our approach registered an average accuracy boost of $23%$, $24%$, and $15%$.
arXiv Detail & Related papers (2023-08-16T18:13:27Z) - AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning [33.29431287523664]
We present a new benchmark for pinpointing compositional-temporal reasoning.
AGQA contains $192M$ unbalanced answer pairs for $9.6K videos.
Human evaluators marked $86.02%$ of our question-answer pairs as correct, the best model achieves only $47.74%$ accuracy.
arXiv Detail & Related papers (2021-03-30T00:24:01Z) - Counterfactual Variable Control for Robust and Interpretable Question
Answering [57.25261576239862]
Deep neural network based question answering (QA) models are neither robust nor explainable in many cases.
In this paper, we inspect such spurious "capability" of QA models using causal inference.
We propose a novel approach called Counterfactual Variable Control (CVC) that explicitly mitigates any shortcut correlation.
arXiv Detail & Related papers (2020-10-12T10:09:05Z) - Improving Robustness and Generality of NLP Models Using Disentangled
Representations [62.08794500431367]
Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$.
We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning.
We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
arXiv Detail & Related papers (2020-09-21T02:48:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.