Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering
- URL: http://arxiv.org/abs/2506.06313v3
- Date: Sat, 04 Oct 2025 00:28:12 GMT
- Title: Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering
- Authors: Huiyao Chen, Yi Yang, Yinghui Li, Meishan Zhang, Min Zhang,
- Abstract summary: We present a discourse-aware hierarchical framework to enhance long document question answering.<n>The framework involves three key innovations: specialized discourse parsing for lengthy documents, LLM-based enhancement of discourse relation nodes, and structure-guided hierarchical retrieval.
- Score: 51.7493726399073
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long document question answering systems typically process texts as flat sequences or use arbitrary segmentation, failing to capture discourse structures that guide human comprehension. We present a discourse-aware hierarchical framework that leverages rhetorical structure theory (RST) to enhance long document question answering. Our approach converts discourse trees into sentence-level representations and employs LLM-enhanced node representations to bridge structural and semantic information. The framework involves three key innovations: specialized discourse parsing for lengthy documents, LLM-based enhancement of discourse relation nodes, and structure-guided hierarchical retrieval. Comprehensive experiments on QASPER, QuALITY, and NarrativeQA demonstrate consistent improvements over existing approaches. Ablation studies confirm that incorporating discourse structure significantly enhances question answering across diverse document types.
Related papers
- MoDora: Tree-Based Semi-Structured Document Analysis System [62.01015188258797]
Semi-structured documents integrate diverse interleaved data elements arranged in various and often irregular layouts.<n>MoDora is an LLM-powered system for semi-structured document analysis.<n> Experiments show MoDora outperforms baselines by 5.97%-61.07% in accuracy.
arXiv Detail & Related papers (2026-02-26T14:48:49Z) - DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search [23.447631421934847]
DeepRead is a structure-aware document reasoning agent designed to operationalize document-native structural priors into actionable reasoning capabilities.<n>DeepRead elicits a human-like locate-then-read'' reasoning paradigm, effectively mitigating the context fragmentation inherent in traditional retrieval methods.
arXiv Detail & Related papers (2026-02-04T20:03:28Z) - Structured Attention Matters to Multimodal LLMs in Document Understanding [52.37530640460363]
We investigate how input format influences document comprehension performance.<n>We discover that raw OCR text often impairs rather than improves MLLMs' performance.<n>We propose a novel structure-preserving approach that encodes document elements using the LaTex paradigm.
arXiv Detail & Related papers (2025-06-19T07:16:18Z) - Align to Structure: Aligning Large Language Models with Structural Information [26.960069076925386]
We introduce Structural Alignment, a novel method that aligns large language models with human-like discourse structures to enhance long-form text generation.<n>We employ a dense reward scheme within a Proximal Policy Optimization framework, assigning fine-grained, token-level rewards based on the discourse distinctiveness relative to human writing.
arXiv Detail & Related papers (2025-04-04T17:40:04Z) - Graph-tree Fusion Model with Bidirectional Information Propagation for Long Document Classification [20.434941308959786]
Long document classification presents challenges due to their extensive content and complex structure.
Existing methods often struggle with token limits and fail to adequately model hierarchical relationships within documents.
Our approach integrates syntax trees for sentence encodings and document graphs for document encodings, which capture fine-grained syntactic relationships and broader document contexts.
arXiv Detail & Related papers (2024-10-03T19:25:01Z) - SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding [55.48936731641802]
We present the SRFUND, a hierarchically structured multi-task form understanding benchmark.
SRFUND provides refined annotations on top of the original FUNSD and XFUND datasets.
The dataset includes eight languages including English, Chinese, Japanese, German, French, Spanish, Italian, and Portuguese.
arXiv Detail & Related papers (2024-06-13T02:35:55Z) - Unsupervised Mutual Learning of Discourse Parsing and Topic Segmentation in Dialogue [37.618612723025784]
In dialogue systems, discourse plays a crucial role in managing conversational focus and coordinating interactions.<n>It consists of two key structures: rhetorical structure and topic structure.<n>We introduce a unified representation that integrates rhetorical and topic structures, ensuring semantic consistency between them.<n>We propose an unsupervised mutual learning framework (UMLF) that jointly models rhetorical and topic structures, allowing them to mutually reinforce each other without requiring additional annotations.
arXiv Detail & Related papers (2024-05-30T08:10:50Z) - Modeling Unified Semantic Discourse Structure for High-quality Headline Generation [45.23071138765902]
We propose using a unified semantic discourse structure (S3) to represent document semantics.
The hierarchical composition of sentence, clause, and word intrinsically characterizes the semantic meaning of the overall document.
Our work can be instructive for a broad range of document modeling tasks, more than headline or summarization generation.
arXiv Detail & Related papers (2024-03-23T09:18:53Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - Language Models As Semantic Indexers [78.83425357657026]
We introduce LMIndexer, a self-supervised framework to learn semantic IDs with a generative language model.
We show the high quality of the learned IDs and demonstrate their effectiveness on three tasks including recommendation, product search, and document retrieval.
arXiv Detail & Related papers (2023-10-11T18:56:15Z) - RST-style Discourse Parsing Guided by Document-level Content Structures [27.28989421841165]
Existing RST parsing pipelines construct rhetorical structures without the knowledge of document-level content structures.
We propose a novel pipeline for RST-DP that incorporates structure-aware news content sentence representations.
arXiv Detail & Related papers (2023-09-08T05:50:27Z) - Revisiting Conversation Discourse for Dialogue Disentanglement [88.3386821205896]
We propose enhancing dialogue disentanglement by taking full advantage of the dialogue discourse characteristics.
We develop a structure-aware framework to integrate the rich structural features for better modeling the conversational semantic context.
Our work has great potential to facilitate broader multi-party multi-thread dialogue applications.
arXiv Detail & Related papers (2023-06-06T19:17:47Z) - Advancing Topic Segmentation and Outline Generation in Chinese Texts: The Paragraph-level Topic Representation, Corpus, and Benchmark [44.06803331843307]
paragraph-level topic structure can grasp and understand the overall context of a document from a higher level.
The lack of large-scale, high-quality Chinese paragraph-level topic structure corpora restrained research and applications.
We propose a hierarchical paragraph-level topic structure representation with three layers to guide the corpus construction.
We employ a two-stage man-machine collaborative annotation method to construct the largest Chinese paragraph-level Topic Structure corpus.
arXiv Detail & Related papers (2023-05-24T06:43:23Z) - Uncovering the Potential of ChatGPT for Discourse Analysis in Dialogue:
An Empirical Study [51.079100495163736]
This paper systematically inspects ChatGPT's performance in two discourse analysis tasks: topic segmentation and discourse parsing.
ChatGPT demonstrates proficiency in identifying topic structures in general-domain conversations yet struggles considerably in specific-domain conversations.
Our deeper investigation indicates that ChatGPT can give more reasonable topic structures than human annotations but only linearly parses the hierarchical rhetorical structures.
arXiv Detail & Related papers (2023-05-15T07:14:41Z) - DiscoPrompt: Path Prediction Prompt Tuning for Implicit Discourse
Relation Recognition [27.977742959064916]
We propose a prompt-based path prediction method to utilize the interactive information and intrinsic senses among the hierarchy in IDRR.
This is the first work that injects such structure information into pre-trained language models via prompt tuning.
arXiv Detail & Related papers (2023-05-06T08:16:07Z) - Discourse Analysis via Questions and Answers: Parsing Dependency
Structures of Questions Under Discussion [57.43781399856913]
This work adopts the linguistic framework of Questions Under Discussion (QUD) for discourse analysis.
We characterize relationships between sentences as free-form questions, in contrast to exhaustive fine-grained questions.
We develop the first-of-its-kind QUD that derives a dependency structure of questions over full documents.
arXiv Detail & Related papers (2022-10-12T03:53:12Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - An End-to-End Document-Level Neural Discourse Parser Exploiting
Multi-Granularity Representations [24.986030179701405]
We exploit robust representations derived from multiple levels of granularity across syntax and semantics.
We incorporate such representations in an end-to-end encoder-decoder neural architecture for more resourceful discourse processing.
arXiv Detail & Related papers (2020-12-21T08:01:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.