From Context to EDUs: Faithful and Structured Context Compression via Elementary Discourse Unit Decomposition
- URL: http://arxiv.org/abs/2512.14244v2
- Date: Thu, 18 Dec 2025 09:35:25 GMT
- Title: From Context to EDUs: Faithful and Structured Context Compression via Elementary Discourse Unit Decomposition
- Authors: Yiqing Zhou, Yu Lei, Shuzheng Si, Qingyan Sun, Wei Wang, Yifei Wu, Hao Wen, Gang Chen, Fanchao Qi, Maosong Sun,
- Abstract summary: We introduce a novel explicit compression framework designed to preserve both global structure and fine-grained details.<n>Our approach reformulates a structural context compression as a structure-then-select process.<n>Our method achieves state-of-the-art structural prediction accuracy and significantly outperforms frontier LLMs.
- Score: 46.36937947958481
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Managing extensive context remains a critical bottleneck for Large Language Models (LLMs), particularly in applications like long-document question answering and autonomous agents where lengthy inputs incur high computational costs and introduce noise. Existing compression techniques often disrupt local coherence through discrete token removal or rely on implicit latent encoding that suffers from positional bias and incompatibility with closed-source APIs. To address these limitations, we introduce the EDU-based Context Compressor, a novel explicit compression framework designed to preserve both global structure and fine-grained details. Our approach reformulates context compression as a structure-then-select process. First, our LingoEDU transforms linear text into a structural relation tree of Elementary Discourse Units (EDUs) which are anchored strictly to source indices to eliminate hallucination. Second, a lightweight ranking module selects query-relevant sub-trees for linearization. To rigorously evaluate structural understanding, we release StructBench, a manually annotated dataset of 248 diverse documents. Empirical results demonstrate that our method achieves state-of-the-art structural prediction accuracy and significantly outperforms frontier LLMs while reducing costs. Furthermore, our structure-aware compression substantially enhances performance across downstream tasks ranging from long-context tasks to complex Deep Search scenarios.
Related papers
- Stan: An LLM-based thermodynamics course assistant [0.0]
Stan is a suite of tools for an undergraduate chemical engineering thermodynamics course built on a data pipeline that we develop and deploy in dual roles.<n>On the student side, a retrieval-augmented generation (RAG) pipeline answers natural-language queries by extracting technical terms.<n>On the instructor side, the same transcript corpus is processed through structured analysis pipelines that produce per-lecture summaries.
arXiv Detail & Related papers (2026-03-04T22:44:50Z) - Structure and Diversity Aware Context Bubble Construction for Enterprise Retrieval Augmented Systems [0.7734726150561088]
Large language model (LLM) contexts are typically constructed using retrieval-augmented generation (RAG)<n>This paper proposes a structure-informed and diversity-constrained context bubble construction framework.
arXiv Detail & Related papers (2026-01-15T18:43:19Z) - Improving LLM Reasoning with Homophily-aware Structural and Semantic Text-Attributed Graph Compression [55.51959317490934]
Large language models (LLMs) have demonstrated promising capabilities in Text-Attributed Graph (TAG) understanding.<n>We argue that graphs inherently contain rich structural and semantic information, and that their effective exploitation can unlock potential gains in LLMs reasoning performance.<n>We propose Homophily-aware Structural and Semantic Compression for LLMs (HS2C), a framework centered on exploiting graph homophily.
arXiv Detail & Related papers (2026-01-13T03:35:18Z) - RePo: Language Models with Context Re-Positioning [10.269249887819988]
In-context learning is fundamental to modern Large Language Models (LLMs)<n> prevailing architectures impose a rigid and fixed contextual structure by assigning linear or constant positional indices.<n>We propose RePo, a novel mechanism that reduces extraneous load via context re-positioning.
arXiv Detail & Related papers (2025-12-16T13:30:30Z) - AdmTree: Compressing Lengthy Context with Adaptive Semantic Trees [66.39371821756649]
We propose AdmTree, a novel framework for adaptive, hierarchical context compression.<n>AdmTree segments input based on information density, utilizing gist tokens to summarize variable-length segments as the leaves of a semantic binary tree.<n>By preserving fine-grained details alongside global semantic coherence, mitigating positional bias, and dynamically adapting to content, AdmTree robustly retains the semantic information of long contexts.
arXiv Detail & Related papers (2025-12-04T08:04:19Z) - Completion by Comprehension: Guiding Code Generation with Multi-Granularity Understanding [37.78627994991325]
CoCo is a novel framework that enables code Completion by of multi-granularity context from large-scale code repositories.<n>Experiments on CrossCodeEval and RepoEval benchmarks demonstrate that CoCo consistently surpasses state-of-the-art baselines.
arXiv Detail & Related papers (2025-12-04T07:37:59Z) - Structure-R1: Dynamically Leveraging Structural Knowledge in LLM Reasoning through Reinforcement Learning [29.722512436773638]
We propose textscStructure-R1, a framework that transforms retrieved content into structured representations optimized for reasoning.<n>We show that textscStructure-R1 consistently achieves competitive performance with a 7B-scale backbone model.<n>Our theoretical analysis demonstrates how structured representations enhance reasoning by improving information density and contextual clarity.
arXiv Detail & Related papers (2025-10-16T23:19:28Z) - Struc-EMB: The Potential of Structure-Aware Encoding in Language Embeddings [16.728984584960738]
This paper introduces and systematically evaluates a new paradigm for generating structure-aware text embeddings.<n>We investigate two primary in-process methods: sequential concatenation and parallel caching.<n>Our analysis reveals critical trade-offs: sequential concatenation excels with noisy, moderate-length contexts, while parallel caching scales more effectively to long, high-signal contexts but is more susceptible to distractors.
arXiv Detail & Related papers (2025-10-09T19:45:54Z) - Data Dependency-Aware Code Generation from Enhanced UML Sequence Diagrams [54.528185120850274]
We propose a novel step-by-step code generation framework named API2Dep.<n>First, we introduce an enhanced Unified Modeling Language (UML) API diagram tailored for service-oriented architectures.<n>Second, recognizing the critical role of data flow, we introduce a dedicated data dependency inference task.
arXiv Detail & Related papers (2025-08-05T12:28:23Z) - BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression [91.23933111083389]
Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge.<n>This paper presents BRIEF, a lightweight approach that performs query-aware multi-hop reasoning.<n>Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries.
arXiv Detail & Related papers (2024-10-20T04:24:16Z) - Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic [51.967603572656266]
We introduce a consistent and theoretically grounded approach to annotating decompositional entailment.
We find that our new dataset, RDTE, has a substantially higher internal consistency (+9%) than prior decompositional entailment datasets.
We also find that training an RDTE-oriented entailment classifier via knowledge distillation and employing it in an entailment tree reasoning engine significantly improves both accuracy and proof quality.
arXiv Detail & Related papers (2024-02-22T18:55:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.