ConvergeWriter: Data-Driven Bottom-Up Article Construction
- URL: http://arxiv.org/abs/2509.12811v1
- Date: Tue, 16 Sep 2025 08:30:52 GMT
- Title: ConvergeWriter: Data-Driven Bottom-Up Article Construction
- Authors: Binquan Ji, Jiaqi Wang, Ruiting Li, Xingchen Han, Yiyang Qi, Shichao Wang, Yifei Lu, Yuantao Han, Feiliang Ren,
- Abstract summary: Large Language Models (LLMs) have shown remarkable prowess in text generation.<n>Yet producing long-form, factual documents grounded in extensive external knowledge bases remains a significant challenge.<n>We propose a novel "bottom-up," data-driven framework that inverts the conventional generation pipeline.
- Score: 6.782320986360278
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have shown remarkable prowess in text generation, yet producing long-form, factual documents grounded in extensive external knowledge bases remains a significant challenge. Existing "top-down" methods, which first generate a hypothesis or outline and then retrieve evidence, often suffer from a disconnect between the model's plan and the available knowledge, leading to content fragmentation and factual inaccuracies. To address these limitations, we propose a novel "bottom-up," data-driven framework that inverts the conventional generation pipeline. Our approach is predicated on a "Retrieval-First for Knowledge, Clustering for Structure" strategy, which first establishes the "knowledge boundaries" of the source corpus before any generative planning occurs. Specifically, we perform exhaustive iterative retrieval from the knowledge base and then employ an unsupervised clustering algorithm to organize the retrieved documents into distinct "knowledge clusters." These clusters form an objective, data-driven foundation that directly guides the subsequent generation of a hierarchical outline and the final document content. This bottom-up process ensures that the generated text is strictly constrained by and fully traceable to the source material, proactively adapting to the finite scope of the knowledge base and fundamentally mitigating the risk of hallucination. Experimental results on both 14B and 32B parameter models demonstrate that our method achieves performance comparable to or exceeding state-of-the-art baselines, and is expected to demonstrate unique advantages in knowledge-constrained scenarios that demand high fidelity and structural coherence. Our work presents an effective paradigm for generating reliable, structured, long-form documents, paving the way for more robust LLM applications in high-stakes, knowledge-intensive domains.
Related papers
- Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval [60.25608870901428]
Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs)<n>We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source robustness.
arXiv Detail & Related papers (2026-03-05T18:42:51Z) - LLM-Driven Ontology Construction for Enterprise Knowledge Graphs [0.0]
This paper introduces OntoEKG, a pipeline designed to accelerate the generation of domain-specific unstructured from enterprise data.<n>Our approach decomposes the modelling task into two distinct phases: an extraction module that identifies core classes and properties, and an entailment module that logically these elements into a hierarchy before serialising them into standard RDF.<n>Addressing the significant lack of comprehensive benchmarks for end-to-end construction, we adopt a new evaluation dataset derived from documents across the Data, Finance, and Logistics sectors.
arXiv Detail & Related papers (2026-02-01T15:13:30Z) - Towards Context-aware Reasoning-enhanced Generative Searching in E-commerce [61.03081096959132]
We propose a context-aware reasoning-enhanced generative search framework for better textbfunderstanding the complicated context.<n>Our approach achieves superior performance compared with strong baselines, validating its effectiveness for search-based recommendation.
arXiv Detail & Related papers (2025-10-19T16:46:11Z) - Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z) - Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation [18.99847259801634]
We propose Reinforcement Learning from Augmented Generation (RLAG) to embed domain knowledge into large language models.<n>Our approach iteratively cycles between sampling generations and optimize the model through calculated rewards.<n> Experimental results across medical, legal, astronomy, and current events datasets demonstrate that our proposed method significantly outperforms baseline approaches.
arXiv Detail & Related papers (2025-09-24T14:30:16Z) - Enhancing Retrieval Augmented Generation with Hierarchical Text Segmentation Chunking [0.9968037829925942]
This paper proposes a novel framework that enhances RAG by integrating hierarchical text segmentation and clustering.<n>During inference, the framework retrieves information by leveraging both segment-level and cluster-level vector representations.<n> Evaluations on the NarrativeQA, QuALITY, and QASPER datasets indicate that the proposed method achieved improved results compared to traditional chunking techniques.
arXiv Detail & Related papers (2025-07-14T05:21:58Z) - DREAM: Document Reconstruction via End-to-end Autoregressive Model [53.51754520966657]
We present an innovative autoregressive model specifically designed for document reconstruction, referred to as Document Reconstruction via End-to-end Autoregressive Model (DREAM)<n>We establish a standardized definition of the document reconstruction task, and introduce a novel Document Similarity Metric (DSM) and DocRec1K dataset for assessing the performance of the task.
arXiv Detail & Related papers (2025-07-08T09:24:07Z) - Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs [31.457954100196524]
We propose a trustworthy reasoning framework, termed Deliberation over Priors (DP)<n>DP integrates structural priors into Large Language Models (LLMs) through a combination of supervised fine-tuning and Kahneman-Tversky optimization.<n>Our framework employs a reasoning-introspection strategy, which guides LLMs to perform refined reasoning verification based on extracted constraint priors.
arXiv Detail & Related papers (2025-05-21T07:38:45Z) - Oreo: A Plug-in Context Reconstructor to Enhance Retrieval-Augmented Generation [28.568010424711563]
Retrieval-Augmented Generation (RAG) aims to augment the capabilities of Large Language Models (LLMs)<n>We introduce a compact, efficient, and pluggable module designed to refine retrieved chunks before using them for generation.
arXiv Detail & Related papers (2025-02-18T16:38:39Z) - GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation [108.2008975785364]
Graph Inspired Veracity Extrapolation (GIVE) is a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input.<n>GIVE guides the LLM agent to select the most pertinent expert data (observe), engage in query-specific divergent thinking (reflect), and then synthesize this information to produce the final output (speak)
arXiv Detail & Related papers (2024-10-11T03:05:06Z) - Confidence-Aware Sub-Structure Beam Search (CABS): Mitigating Hallucination in Structured Data Generation with Large Language Models [6.099774114286838]
Confidence estimation methods on Large Language Models (LLMs) primarily focus on the confidence at the individual token level or the entire output sequence level.
We propose Confidence-Aware sub-structure Beam Search (CABS), a novel decoding method operating at the sub-structure level in structured data generation.
Results show that CABS outperforms traditional token-level beam search for structured data generation by 16.7% Recall at 90% precision averagely on the problem of product attribute generation.
arXiv Detail & Related papers (2024-05-30T18:21:05Z) - Everything is Editable: Extend Knowledge Editing to Unstructured Data in Large Language Models [65.10456412127405]
We propose a novel Unstructured Knowledge Editing method, namely UnKE.<n>In the layer dimension, we propose non-local block key-value storage to replace local layer key-value storage.<n>In the token dimension, we replace "term-driven optimization" with "cause-driven optimization", which edits the last token directly while preserving context.
arXiv Detail & Related papers (2024-05-24T08:42:40Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - Schema-aware Reference as Prompt Improves Data-Efficient Knowledge Graph
Construction [57.854498238624366]
We propose a retrieval-augmented approach, which retrieves schema-aware Reference As Prompt (RAP) for data-efficient knowledge graph construction.
RAP can dynamically leverage schema and knowledge inherited from human-annotated and weak-supervised data as a prompt for each sample.
arXiv Detail & Related papers (2022-10-19T16:40:28Z) - Principled Knowledge Extrapolation with GANs [92.62635018136476]
We study counterfactual synthesis from a new perspective of knowledge extrapolation.
We show that an adversarial game with a closed-form discriminator can be used to address the knowledge extrapolation problem.
Our method enjoys both elegant theoretical guarantees and superior performance in many scenarios.
arXiv Detail & Related papers (2022-05-21T08:39:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.