Related papers: WIKIGENBENCH: Exploring Full-length Wikipedia Generation under Real-World Scenario

WIKIGENBENCH: Exploring Full-length Wikipedia Generation under Real-World Scenario

URL: http://arxiv.org/abs/2402.18264v2
Date: Tue, 17 Dec 2024 09:53:41 GMT
Title: WIKIGENBENCH: Exploring Full-length Wikipedia Generation under Real-World Scenario
Authors: Jiebin Zhang, Eugene J. Yu, Qinyu Chen, Chenhao Xiong, Dawei Zhu, Han Qian, Mingbo Song, Weimin Xiong, Xiaoguang Li, Qun Liu, Sujian Li,
Abstract summary: WIKIGENBENCH is a new benchmark consisting of 1,320 entries.<n>For generation, we explore a real-world scenario where structured, full-length Wikipedia articles with citations are generated for new events using input documents from web sources.<n>For evaluation, we integrate systematic metrics and LLM-based metrics to assess the verifiability, organization, and other aspects aligned with real-world scenarios.
Score: 32.28150998156827
License: http://creativecommons.org/licenses/by/4.0/
Abstract: It presents significant challenges to generate comprehensive and accurate Wikipedia articles for newly emerging events under a real-world scenario. Existing attempts fall short either by focusing only on short snippets or by using metrics that are insufficient to evaluate real-world scenarios. In this paper, we construct WIKIGENBENCH, a new benchmark consisting of 1,320 entries, designed to align with real-world scenarios in both generation and evaluation. For generation, we explore a real-world scenario where structured, full-length Wikipedia articles with citations are generated for new events using input documents from web sources. For evaluation, we integrate systematic metrics and LLM-based metrics to assess the verifiability, organization, and other aspects aligned with real-world scenarios. Based on this benchmark, we conduct extensive experiments using various models within three commonly used frameworks: direct RAG, hierarchical structure-based RAG, and RAG with a fine-tuned generation model. Experimental results show that hierarchical-based methods can generate more comprehensive content, while fine-tuned methods achieve better verifiability. However, even the best methods still show a significant gap compared to existing Wikipedia content, indicating that further research is necessary.

Related papers

Hierarchical Memory Organization for Wikipedia Generation [41.60777339440196]
This paper introduces the Memory Organization-based Generation (MOG) framework to generate Wikipedia articles autonomously.<n>MOG extracts fine-grained memory units from web documents, organizes them into a Wikipedia-style hierarchical structure, and uses this structure to guide the generation process.<n> Evaluations on our newly created WikiStart dataset demonstrate that MOG outperforms baseline methods in producing informative and reliable articles.
arXiv Detail & Related papers (2025-06-29T20:22:49Z)
Consistency Evaluation of News Article Summaries Generated by Large (and Small) Language Models [0.0]
Large Language Models (LLMs) have shown promise in generating fluent abstractive summaries but they can produce hallucinated details not grounded in the source text. This paper embarks on an exploration of text summarization with a diverse set of techniques, including TextRank, BART, Mistral-7B-Instruct, and OpenAI GPT-3.5-Turbo. We find that all summarization models produce consistent summaries when tested on the XL-Sum dataset.
arXiv Detail & Related papers (2025-02-28T01:58:17Z)
Enhanced Retrieval of Long Documents: Leveraging Fine-Grained Block Representations with Large Language Models [24.02950598944251]
We introduce a novel, fine-grained approach aimed at enhancing the accuracy of relevance scoring for long documents. Our methodology firstly segments a long document into blocks, each of which is embedded using an LLM. We aggregate the query-block relevance scores through a weighted sum method, yielding a comprehensive score for the query with the entire document.
arXiv Detail & Related papers (2025-01-28T16:03:52Z)
Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored. We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios. With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits [92.62157408704594]
HelloFresh is based on continuous streams of real-world data generated by intrinsically motivated human labelers. It covers recent events from X (formerly Twitter) community notes and edits of Wikipedia pages. It mitigates the risk of test data contamination and benchmark overfitting.
arXiv Detail & Related papers (2024-06-05T16:25:57Z)
Wikiformer: Pre-training with Structured Information of Wikipedia for Ad-hoc Retrieval [21.262531222066208]
In this paper, we devise four pre-training objectives tailored for information retrieval tasks based on the structured knowledge of Wikipedia. Compared to existing pre-training methods, our approach can better capture the semantic knowledge in the training corpus. Experimental results in biomedical and legal domains demonstrate that our approach achieves better performance in vertical domains.
arXiv Detail & Related papers (2023-12-17T09:31:47Z)
Mapping Process for the Task: Wikidata Statements to Text as Wikipedia Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level. The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia. We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z)
Zero-Shot On-the-Fly Event Schema Induction [61.91468909200566]
We present a new approach in which large language models are utilized to generate source documents that allow predicting, given a high-level event definition, the specific events, arguments, and relations between them. Using our model, complete schemas on any topic can be generated on-the-fly without any manual data collection, i.e., in a zero-shot manner.
arXiv Detail & Related papers (2022-10-12T14:37:00Z)
Autoregressive Search Engines: Generating Substrings as Document Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers. Previous work has explored ways to partition the search space into hierarchical structures. In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z)
Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects. Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency. We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z)
Surfer100: Generating Surveys From Web Resources on Wikipedia-style [49.23675182917996]
We show that recent advances in pretrained language modeling can be combined for a two-stage extractive and abstractive approach for Wikipedia lead paragraph generation. We extend this approach to generate longer Wikipedia-style summaries with sections and examine how such methods struggle in this application through detailed studies with 100 reference human-collected surveys.
arXiv Detail & Related papers (2021-12-13T02:18:01Z)
Unsupervised Keyphrase Extraction by Jointly Modeling Local and Global Context [25.3472693740778]
Embedding based methods are widely used for unsupervised keyphrase extraction (UKE) tasks. In this paper, we propose a novel method for UKE, where local and global contexts are jointly modeled.
arXiv Detail & Related papers (2021-09-15T13:41:10Z)
Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia [4.148821165759295]
We build the first dataset of English Wikipedia articles annotated with a wide set of content reliability issues. To build this dataset, we rely on Wikipedia "templates" We select the 10 most popular reliability-related templates on Wikipedia, and propose an effective method to label almost 1M samples of Wikipedia article revisions as positive or negative.
arXiv Detail & Related papers (2021-05-10T05:07:03Z)
WEC: Deriving a Large-scale Cross-document Event Coreference dataset from Wikipedia [14.324743524196874]
We present Wikipedia Event Coreference (WEC), an efficient methodology for gathering a large-scale dataset for cross-document event coreference from Wikipedia. We apply this methodology to the English Wikipedia and extract our large-scale WEC-Eng dataset. We develop an algorithm that adapts components of state-of-the-art models for within-document coreference resolution to the cross-document setting.
arXiv Detail & Related papers (2021-04-11T14:54:35Z)
Generating Wikipedia Article Sections from Diverse Data Sources [57.23574577984244]
We benchmark several training and decoding strategies on WikiTableT. Our qualitative analysis shows that the best approaches can generate fluent and high quality texts but they sometimes struggle with coherence.
arXiv Detail & Related papers (2020-12-29T19:35:34Z)
KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation [100.79870384880333]
We propose a knowledge-grounded pre-training (KGPT) to generate knowledge-enriched text. We adopt three settings, namely fully-supervised, zero-shot, few-shot to evaluate its effectiveness. Under zero-shot setting, our model achieves over 30 ROUGE-L on WebNLG while all other baselines fail.
arXiv Detail & Related papers (2020-10-05T19:59:05Z)
Towards Making the Most of Context in Neural Machine Translation [112.9845226123306]
We argue that previous research did not make a clear use of the global context. We propose a new document-level NMT framework that deliberately models the local context of each sentence.
arXiv Detail & Related papers (2020-02-19T03:30:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.