DeFine: A Decomposed and Fine-Grained Annotated Dataset for Long-form Article Generation
- URL: http://arxiv.org/abs/2503.07170v1
- Date: Mon, 10 Mar 2025 10:48:00 GMT
- Title: DeFine: A Decomposed and Fine-Grained Annotated Dataset for Long-form Article Generation
- Authors: Ming Wang, Fang Wang, Minghao Hu, Li He, Haiyang Wang, Jun Zhang, Tianwei Yan, Li Li, Zhunchen Luo, Wei Luo, Xiaoying Bai, Guotong Geng,
- Abstract summary: We introduce DeFine, a Decomposed and Fine-grained annotated dataset for long-form article generation.<n>DeFine is characterized by its hierarchical decomposition strategy and the integration of domain-specific knowledge with multi-level annotations.<n>The experimental results showed significant improvements in text quality, specifically in topic coverage, depth of information, and content fidelity.
- Score: 24.091769825963173
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long-form article generation (LFAG) presents challenges such as maintaining logical consistency, comprehensive topic coverage, and narrative coherence across extended articles. Existing datasets often lack both the hierarchical structure and fine-grained annotation needed to effectively decompose tasks, resulting in shallow, disorganized article generation. To address these limitations, we introduce DeFine, a Decomposed and Fine-grained annotated dataset for long-form article generation. DeFine is characterized by its hierarchical decomposition strategy and the integration of domain-specific knowledge with multi-level annotations, ensuring granular control and enhanced depth in article generation. To construct the dataset, a multi-agent collaborative pipeline is proposed, which systematically segments the generation process into four parts: Data Miner, Cite Retreiver, Q&A Annotator and Data Cleaner. To validate the effectiveness of DeFine, we designed and tested three LFAG baselines: the web retrieval, the local retrieval, and the grounded reference. We fine-tuned the Qwen2-7b-Instruct model using the DeFine training dataset. The experimental results showed significant improvements in text quality, specifically in topic coverage, depth of information, and content fidelity. Our dataset publicly available to facilitate future research.
Related papers
- RAPID: Efficient Retrieval-Augmented Long Text Generation with Writing Planning and Information Discovery [69.41989381702858]
Existing methods, such as direct generation and multi-agent discussion, often struggle with issues like hallucinations, topic incoherence, and significant latency.<n>We propose RAPID, an efficient retrieval-augmented long text generation framework.<n>Our work provides a robust and efficient solution to the challenges of automated long-text generation.
arXiv Detail & Related papers (2025-03-02T06:11:29Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.<n>We introduce novel methodologies and datasets to overcome these challenges.<n>We propose MhBART, an encoder-decoder model designed to emulate human writing style.<n>We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - ConTReGen: Context-driven Tree-structured Retrieval for Open-domain Long-form Text Generation [26.4086456393314]
Long-form text generation requires coherent, comprehensive responses that address complex queries with both breadth and depth.
Existing iterative retrieval-augmented generation approaches often struggle to delve deeply into each facet of complex queries.
This paper introduces ConTReGen, a novel framework that employs a context-driven, tree-structured retrieval approach.
arXiv Detail & Related papers (2024-10-20T21:17:05Z) - Integrating Planning into Single-Turn Long-Form Text Generation [66.08871753377055]
We propose to use planning to generate long form content.
Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning.
Our experiments demonstrate on two datasets from different domains, that LLMs fine-tuned with the auxiliary task generate higher quality documents.
arXiv Detail & Related papers (2024-10-08T17:02:40Z) - A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback.
First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF.
Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z) - HIBRIDS: Attention with Hierarchical Biases for Structure-aware Long
Document Summarization [17.58231642569116]
We present HIBRIDS, which injects Hierarchical Biases foR incorporating Document Structure into the calculation of attention scores.
We also present a new task, hierarchical question-summary generation, for summarizing salient content in the source document into a hierarchy of questions and summaries.
arXiv Detail & Related papers (2022-03-21T05:27:35Z) - BookSum: A Collection of Datasets for Long-form Narrative Summarization [42.26628743419607]
BookSum is a collection of datasets for long-form narrative summarization.
Our dataset covers source documents from the literature domain, such as novels, plays and stories.
arXiv Detail & Related papers (2021-05-18T00:22:46Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.