SWiPE: A Dataset for Document-Level Simplification of Wikipedia Pages
- URL: http://arxiv.org/abs/2305.19204v1
- Date: Tue, 30 May 2023 16:52:42 GMT
- Title: SWiPE: A Dataset for Document-Level Simplification of Wikipedia Pages
- Authors: Philippe Laban, Jesse Vig, Wojciech Kryscinski, Shafiq Joty, Caiming
Xiong, Chien-Sheng Wu
- Abstract summary: We introduce the SWiPE dataset, which reconstructs the document-level editing process from English Wikipedia (EW) articles to paired Simple Wikipedia (SEW) articles.
We work with Wikipedia editors to annotate 5,000 EW-SEW document pairs, labeling more than 40,000 edits with proposed 19 categories.
We find that SWiPE-trained models generate more complex edits while reducing unwanted edits.
- Score: 87.08880616654258
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text simplification research has mostly focused on sentence-level
simplification, even though many desirable edits - such as adding relevant
background information or reordering content - may require document-level
context. Prior work has also predominantly framed simplification as a
single-step, input-to-output task, only implicitly modeling the fine-grained,
span-level edits that elucidate the simplification process. To address both
gaps, we introduce the SWiPE dataset, which reconstructs the document-level
editing process from English Wikipedia (EW) articles to paired Simple Wikipedia
(SEW) articles. In contrast to prior work, SWiPE leverages the entire revision
history when pairing pages in order to better identify simplification edits. We
work with Wikipedia editors to annotate 5,000 EW-SEW document pairs, labeling
more than 40,000 edits with proposed 19 categories. To scale our efforts, we
propose several models to automatically label edits, achieving an F-1 score of
up to 70.6, indicating that this is a tractable but challenging NLU task.
Finally, we categorize the edits produced by several simplification models and
find that SWiPE-trained models generate more complex edits while reducing
unwanted edits.
Related papers
- DocEdit-v2: Document Structure Editing Via Multimodal LLM Grounding [128.92659116774374]
We introduce DocEdit-v2, a novel framework that performs end-to-end document editing by leveraging Large Multimodal Models (LMMs)
It consists of three novel components: (1) Doc2Command, which simultaneously localizes edit regions of interest (RoI) and disambiguates user edit requests into edit commands; (2) LLM-based Command Reformulation prompting to tailor edit commands originally intended for specialized software into edit instructions suitable for generalist LMMs; and (3) Moreover, DocEdit-v2 processes these outputs via Large Multimodal Models like GPT-4V and Gemini, to parse the document layout, execute edits on
arXiv Detail & Related papers (2024-10-21T19:59:04Z) - CoEdIT: Text Editing by Task-Specific Instruction Tuning [18.824571167583432]
CoEdIT is a state-of-the-art text editing system for writing assistance.
It takes instructions from the user specifying the attributes of the desired text, and outputs the edited text.
We present a large language model fine-tuned on a diverse collection of task-specific instructions for text editing.
arXiv Detail & Related papers (2023-05-17T00:05:24Z) - Understanding Iterative Revision from Human-Written Text [10.714872525208385]
IteraTeR is the first large-scale, multi-domain, edit-intention annotated corpus of iteratively revised text.
We better understand the text revision process, making vital connections between edit intentions and writing quality.
arXiv Detail & Related papers (2022-03-08T01:47:42Z) - Document-Level Text Simplification: Dataset, Criteria and Baseline [75.58761130635824]
We define and investigate a new task of document-level text simplification.
Based on Wikipedia dumps, we first construct a large-scale dataset named D-Wikipedia.
We propose a new automatic evaluation metric called D-SARI that is more suitable for the document-level simplification task.
arXiv Detail & Related papers (2021-10-11T08:15:31Z) - Learning Structural Edits via Incremental Tree Transformations [102.64394890816178]
We present a generic model for incremental editing of structured data (i.e., "structural edits")
Our editor learns to iteratively generate tree edits (e.g., deleting or adding a subtree) and applies them to the partially edited data.
We evaluate our proposed editor on two source code edit datasets, where results show that, with the proposed edit encoder, our editor significantly improves accuracy over previous approaches.
arXiv Detail & Related papers (2021-01-28T16:11:32Z) - Text Editing by Command [82.50904226312451]
A prevailing paradigm in neural text generation is one-shot generation, where text is produced in a single step.
We address this limitation with an interactive text generation setting in which the user interacts with the system by issuing commands to edit existing text.
We show that our Interactive Editor, a transformer-based model trained on this dataset, outperforms baselines and obtains positive results in both automatic and human evaluations.
arXiv Detail & Related papers (2020-10-24T08:00:30Z) - A Structural Model for Contextual Code Changes [20.185486717922615]
Given a code snippet that is partially edited, our goal is to predict a completion of the edit for the rest of the snippet.
Our model achieves a 28% relative gain over state-of-the-art sequential models and 2x higher accuracy than syntactic models that learn to generate the edited code.
arXiv Detail & Related papers (2020-05-27T07:16:19Z) - ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification
Models with Multiple Rewriting Transformations [97.27005783856285]
This paper introduces ASSET, a new dataset for assessing sentence simplification in English.
We show that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task.
arXiv Detail & Related papers (2020-05-01T16:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.