Related papers: DISRetrieval: Harnessing Discourse Structure for Long Document Retrieval

DISRetrieval: Harnessing Discourse Structure for Long Document Retrieval

URL: http://arxiv.org/abs/2506.06313v1
Date: Mon, 26 May 2025 14:45:12 GMT
Title: DISRetrieval: Harnessing Discourse Structure for Long Document Retrieval
Authors: Huiyao Chen, Yi Yang, Yinghui Li, Meishan Zhang, Min Zhang,
Abstract summary: DISRetrieval is a novel hierarchical retrieval framework that leverages linguistic discourse structure to enhance long document understanding.<n>Our studies confirm that discourse structure significantly enhances retrieval effectiveness across different document lengths and query types.
Score: 51.89673002051528
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long document understanding has become increasingly crucial in natural language processing, with retrieval-based methods emerging as a promising solution to address the context length limitations of large language models (LLMs). However, existing approaches either treat documents as flat sequences or employ arbitrary chunking strategies, failing to capture the inherent discourse structure that guides human comprehension. We present DISRetrieval, a novel hierarchical retrieval framework that leverages linguistic discourse structure to enhance long document understanding. Our approach introduces three key innovations: (1) a discourse-aware document organization framework that utilizes rhetorical structure theory (RST) to create sentence-level hierarchical representations, preserving both semantic relationships and natural document flow; (2) an LLM-enhanced node representation technique that combines discourse structure with adaptive summarization to enrich tree nodes with contextual information; and (3) a hierarchical evidence retrieval mechanism that effectively selects relevant content while maintaining discourse coherence. Through comprehensive experiments on QASPER and QuALITY datasets, DISRetrieval demonstrates substantial improvements over existing methods in both token-level retrieval metrics and downstream question answering tasks. Our ablation studies confirm that incorporating discourse structure significantly enhances retrieval effectiveness across different document lengths and query types, validating the importance of linguistically-informed document representation in long-text understanding. Our code and datasets are publicly available at github/DreamH1gh/DISRetrieval to facilitate future research.

Related papers

Structured Attention Matters to Multimodal LLMs in Document Understanding [52.37530640460363]
We investigate how input format influences document comprehension performance.<n>We discover that raw OCR text often impairs rather than improves MLLMs' performance.<n>We propose a novel structure-preserving approach that encodes document elements using the LaTex paradigm.
arXiv Detail & Related papers (2025-06-19T07:16:18Z)
Align to Structure: Aligning Large Language Models with Structural Information [26.960069076925386]
We introduce Structural Alignment, a novel method that aligns large language models with human-like discourse structures to enhance long-form text generation.<n>We employ a dense reward scheme within a Proximal Policy Optimization framework, assigning fine-grained, token-level rewards based on the discourse distinctiveness relative to human writing.
arXiv Detail & Related papers (2025-04-04T17:40:04Z)
Graph-tree Fusion Model with Bidirectional Information Propagation for Long Document Classification [20.434941308959786]
Long document classification presents challenges due to their extensive content and complex structure. Existing methods often struggle with token limits and fail to adequately model hierarchical relationships within documents. Our approach integrates syntax trees for sentence encodings and document graphs for document encodings, which capture fine-grained syntactic relationships and broader document contexts.
arXiv Detail & Related papers (2024-10-03T19:25:01Z)
SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding [55.48936731641802]
We present the SRFUND, a hierarchically structured multi-task form understanding benchmark. SRFUND provides refined annotations on top of the original FUNSD and XFUND datasets. The dataset includes eight languages including English, Chinese, Japanese, German, French, Spanish, Italian, and Portuguese.
arXiv Detail & Related papers (2024-06-13T02:35:55Z)
Modeling Unified Semantic Discourse Structure for High-quality Headline Generation [45.23071138765902]
We propose using a unified semantic discourse structure (S3) to represent document semantics. The hierarchical composition of sentence, clause, and word intrinsically characterizes the semantic meaning of the overall document. Our work can be instructive for a broad range of document modeling tasks, more than headline or summarization generation.
arXiv Detail & Related papers (2024-03-23T09:18:53Z)
DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge. Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z)
Language Models As Semantic Indexers [78.83425357657026]
We introduce LMIndexer, a self-supervised framework to learn semantic IDs with a generative language model. We show the high quality of the learned IDs and demonstrate their effectiveness on three tasks including recommendation, product search, and document retrieval.
arXiv Detail & Related papers (2023-10-11T18:56:15Z)
RST-style Discourse Parsing Guided by Document-level Content Structures [27.28989421841165]
Existing RST parsing pipelines construct rhetorical structures without the knowledge of document-level content structures. We propose a novel pipeline for RST-DP that incorporates structure-aware news content sentence representations.
arXiv Detail & Related papers (2023-09-08T05:50:27Z)
Advancing Topic Segmentation and Outline Generation in Chinese Texts: The Paragraph-level Topic Representation, Corpus, and Benchmark [44.06803331843307]
paragraph-level topic structure can grasp and understand the overall context of a document from a higher level. The lack of large-scale, high-quality Chinese paragraph-level topic structure corpora restrained research and applications. We propose a hierarchical paragraph-level topic structure representation with three layers to guide the corpus construction. We employ a two-stage man-machine collaborative annotation method to construct the largest Chinese paragraph-level Topic Structure corpus.
arXiv Detail & Related papers (2023-05-24T06:43:23Z)
Generate rather than Retrieve: Large Language Models are Strong Context Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators. We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z)
UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query. Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms. We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z)
An End-to-End Document-Level Neural Discourse Parser Exploiting Multi-Granularity Representations [24.986030179701405]
We exploit robust representations derived from multiple levels of granularity across syntax and semantics. We incorporate such representations in an end-to-end encoder-decoder neural architecture for more resourceful discourse processing.
arXiv Detail & Related papers (2020-12-21T08:01:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.