HopWeaver: Cross-Document Synthesis of High-Quality and Authentic Multi-Hop Questions
- URL: http://arxiv.org/abs/2505.15087v2
- Date: Wed, 08 Oct 2025 05:36:53 GMT
- Title: HopWeaver: Cross-Document Synthesis of High-Quality and Authentic Multi-Hop Questions
- Authors: Zhiyu Shen, Jiyuan Liu, Yunhe Pang, Yanghui Rao,
- Abstract summary: Multi-Hop Question Answering (MHQA) is crucial for evaluating the model's capability to integrate information from diverse sources.<n>This paper introduces HopWeaver, the first cross-document framework synthesizing authentic multi-hop questions without human intervention.
- Score: 13.229192927243389
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-Hop Question Answering (MHQA) is crucial for evaluating the model's capability to integrate information from diverse sources. However, creating extensive and high-quality MHQA datasets is challenging: (i) manual annotation is expensive, and (ii) current synthesis methods often produce simplistic questions or require extensive manual guidance. This paper introduces HopWeaver, the first cross-document framework synthesizing authentic multi-hop questions without human intervention. HopWeaver synthesizes bridge and comparison questions through an innovative pipeline that identifies complementary documents and constructs authentic reasoning paths to ensure true multi-hop reasoning. We further present a comprehensive system for evaluating the synthesized multi-hop questions. Empirical evaluations demonstrate that the synthesized questions achieve comparable or superior quality to human-annotated datasets at a lower cost. Our framework provides a valuable tool for the research community: it can automatically generate challenging benchmarks from any raw corpus, which opens new avenues for both evaluation and targeted training to improve the reasoning capabilities of advanced QA models, especially in domains with scarce resources.
Related papers
- Inferential Question Answering [67.54465021408724]
We introduce Inferential QA -- a new task that challenges models to infer answers from answer-supporting passages which provide only clues.<n>To study this problem, we construct QUIT (QUestions requiring Inference from Texts) dataset, comprising 7,401 questions and 2.4M passages.<n>We show that methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, and fine-tuning provides inconsistent improvements.
arXiv Detail & Related papers (2026-02-01T14:02:43Z) - Modeling Contextual Passage Utility for Multihop Question Answering [3.8786514101828167]
Multihop Question Answering (QA) requires systems to identify and synthesize information from multiple text passages.<n>We propose a lightweight approach to model contextual passage utility, accounting for inter-passage dependencies.<n>We leverage the reasoning traces from an advanced reasoning model to capture the order in which passages are used to answer a question.
arXiv Detail & Related papers (2025-12-06T14:54:47Z) - OIDA-QA: A Multimodal Benchmark for Analyzing the Opioid Industry Documents Archive [50.468138755368805]
Opioid crisis represents a significant moment in public health.<n>Data and documents disclosed in the UCSF-JHU Opioid Industry Documents Archive (OIDA)<n>In this paper, we tackle this challenge by organizing the original dataset according to document attributes.
arXiv Detail & Related papers (2025-11-13T03:27:32Z) - BMGQ: A Bottom-up Method for Generating Complex Multi-hop Reasoning Questions from Semi-structured Data [8.52473384574856]
We present an automated framework for generating high-difficulty, training-ready multi-hop questions from semi-structured knowledge sources.<n>The system grows diverse, logically labeled evidence clusters through Natural Language Inference (NLI)-based relation typing and diversity-aware expansion.
arXiv Detail & Related papers (2025-10-28T07:43:15Z) - Benchmarking Multimodal Understanding and Complex Reasoning for ESG Tasks [56.350173737493215]
Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency.<n>MMESGBench is a first-of-its-kind benchmark dataset to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents.<n>MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories.
arXiv Detail & Related papers (2025-07-25T03:58:07Z) - iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering [6.4524748618007415]
iQUEST is a question-guided KBQA framework that iteratively decomposes complex queries into simpler sub-questions.<n>We integrate a Graph Neural Network (GNN) to look ahead and incorporate 2-hop neighbor information at each reasoning step.
arXiv Detail & Related papers (2025-06-02T15:30:02Z) - Inter-Passage Verification for Multi-evidence Multi-answer QA [22.233409308846067]
We propose a new multi-answer QA framework -- Retrieval-augmented Independent Reading with Inter-passage Verification.<n>Our framework retrieves a large set of passages and processes each passage individually to generate an initial high-recall but noisy answer set.<n>Our framework significantly outperforms existing baselines across various model sizes, achieving an average F1 score improvement of 11.17%.
arXiv Detail & Related papers (2025-05-31T07:03:52Z) - FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering [21.545569307511183]
Multimodal multihop question answering (MMQA) requires reasoning over images and text from multiple sources.<n>Existing methods focus on single-hop, single-modality, or short texts.<n>We introduce FM2DS, the first framework for creating a high-quality dataset for MMQA.
arXiv Detail & Related papers (2024-12-09T22:35:44Z) - Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents [61.41316121093604]
We present InsCoQA, a novel benchmark for evaluating large language models (LLMs) in the context of conversational question answering (CQA)
Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents.
We also propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
arXiv Detail & Related papers (2024-10-01T09:10:00Z) - Explainable Multi-hop Question Generation: An End-to-End Approach without Intermediate Question Labeling [6.635572580071933]
Multi-hop question generation aims to generate complex questions that requires multi-step reasoning over several documents.
Previous studies have predominantly utilized end-to-end models, wherein questions are decoded based on the representation of context documents.
This paper introduces an end-to-end question rewriting model that increases question complexity through sequential rewriting.
arXiv Detail & Related papers (2024-03-31T06:03:54Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - QADYNAMICS: Training Dynamics-Driven Synthetic QA Diagnostic for
Zero-Shot Commonsense Question Answering [48.25449258017601]
State-of-the-art approaches fine-tune language models on QA pairs constructed from CommonSense Knowledge Bases.
We propose QADYNAMICS, a training dynamics-driven framework for QA diagnostics and refinement.
arXiv Detail & Related papers (2023-10-17T14:27:34Z) - Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset
and Comprehensive Framework [51.44863255495668]
Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence.
We present Multi-Modal Reasoning(COCO-MMR) dataset, a novel dataset that encompasses an extensive collection of open-ended questions.
We propose innovative techniques, including multi-hop cross-modal attention and sentence-level contrastive learning, to enhance the image and text encoders.
arXiv Detail & Related papers (2023-07-24T08:58:25Z) - Multi-Type Conversational Question-Answer Generation with Closed-ended
and Unanswerable Questions [3.6825890616838066]
Conversational question answering (CQA) facilitates an incremental and interactive understanding of a given context.
We introduce a novel method to synthesize data for CQA with various question types, including open-ended, closed-ended, and unanswerable questions.
Across four domains, CQA systems trained on our synthetic data indeed show good performance close to the systems trained on human-annotated data.
arXiv Detail & Related papers (2022-10-24T07:01:51Z) - Understanding and Improving Zero-shot Multi-hop Reasoning in Generative
Question Answering [85.79940770146557]
We decompose multi-hop questions into multiple corresponding single-hop questions.
We find marked inconsistency in QA models' answers on these pairs of ostensibly identical question chains.
When trained only on single-hop questions, models generalize poorly to multi-hop questions.
arXiv Detail & Related papers (2022-10-09T11:48:07Z) - From Easy to Hard: Two-stage Selector and Reader for Multi-hop Question
Answering [12.072618400000763]
Multi-hop question answering (QA) is a challenging task requiring QA systems to perform complex reasoning over multiple documents.
We propose a novel framework, From Easy to Hard (FE2H), to remove distracting information and obtain better contextual representations.
FE2H divides both the document selector and reader into two stages following an easy-to-hard manner.
arXiv Detail & Related papers (2022-05-24T02:33:58Z) - Modeling Multi-hop Question Answering as Single Sequence Prediction [88.72621430714985]
We propose a simple generative approach (PathFid) that extends the task beyond just answer generation.
PathFid explicitly models the reasoning process to resolve the answer for multi-hop questions.
Our experiments demonstrate that PathFid leads to strong performance gains on two multi-hop QA datasets.
arXiv Detail & Related papers (2022-05-18T21:57:59Z) - Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of
Reasoning Steps [31.472490306390977]
A multi-hop question answering dataset aims to test reasoning and inference skills by requiring a model to read multiple paragraphs to answer a given question.
Previous studies revealed that many examples in existing multi-hop datasets do not require multi-hop reasoning to answer a question.
We present a new multi-hop QA dataset, called 2WikiMultiHopQA, which uses structured and unstructured data.
arXiv Detail & Related papers (2020-11-02T15:42:40Z) - Multi-hop Question Generation with Graph Convolutional Network [58.31752179830959]
Multi-hop Question Generation (QG) aims to generate answer-related questions by aggregating and reasoning over multiple scattered evidence from different paragraphs.
We propose Multi-Hop volution Fusion Network for Question Generation (MulQG), which does context encoding in multiple hops.
Our proposed model is able to generate fluent questions with high completeness and outperforms the strongest baseline by 20.8% in the multi-hop evaluation.
arXiv Detail & Related papers (2020-10-19T06:15:36Z) - Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval [117.07047313964773]
We propose a simple and efficient multi-hop dense retrieval approach for answering complex open-domain questions.
Our method does not require access to any corpus-specific information, such as inter-document hyperlinks or human-annotated entity markers.
Our system also yields a much better efficiency-accuracy trade-off, matching the best published accuracy on HotpotQA while being 10 times faster at inference time.
arXiv Detail & Related papers (2020-09-27T06:12:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.