Document-Level Text Simplification: Dataset, Criteria and Baseline
- URL: http://arxiv.org/abs/2110.05071v1
- Date: Mon, 11 Oct 2021 08:15:31 GMT
- Title: Document-Level Text Simplification: Dataset, Criteria and Baseline
- Authors: Renliang Sun, Hanqi Jin, Xiaojun Wan
- Abstract summary: We define and investigate a new task of document-level text simplification.
Based on Wikipedia dumps, we first construct a large-scale dataset named D-Wikipedia.
We propose a new automatic evaluation metric called D-SARI that is more suitable for the document-level simplification task.
- Score: 75.58761130635824
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text simplification is a valuable technique. However, current research is
limited to sentence simplification. In this paper, we define and investigate a
new task of document-level text simplification, which aims to simplify a
document consisting of multiple sentences. Based on Wikipedia dumps, we first
construct a large-scale dataset named D-Wikipedia and perform analysis and
human evaluation on it to show that the dataset is reliable. Then, we propose a
new automatic evaluation metric called D-SARI that is more suitable for the
document-level simplification task. Finally, we select several representative
models as baseline models for this task and perform automatic evaluation and
human evaluation. We analyze the results and point out the shortcomings of the
baseline models.
Related papers
- fPLSA: Learning Semantic Structures in Document Collections Using Foundation Models [19.099810900404357]
We introduce fPLSA, a foundation-model-based Probabilistic Latent Semantic Analysis (PLSA) method.
PLSA iteratively clusters and tags document segments based on document-level contexts.
Our experiments on story writing, math, and multi-step reasoning datasets demonstrate that fPLSA tags help reconstruct the original texts better than existing tagging methods.
arXiv Detail & Related papers (2024-10-07T20:25:52Z) - SWiPE: A Dataset for Document-Level Simplification of Wikipedia Pages [87.08880616654258]
We introduce the SWiPE dataset, which reconstructs the document-level editing process from English Wikipedia (EW) articles to paired Simple Wikipedia (SEW) articles.
We work with Wikipedia editors to annotate 5,000 EW-SEW document pairs, labeling more than 40,000 edits with proposed 19 categories.
We find that SWiPE-trained models generate more complex edits while reducing unwanted edits.
arXiv Detail & Related papers (2023-05-30T16:52:42Z) - SASS: Data and Methods for Subject Aware Sentence Simplification [0.0]
This paper provides a dataset aimed at training models that perform subject aware sentence simplifications.
We also test models on that dataset which are inspired by model architecture used in abstractive summarization.
arXiv Detail & Related papers (2023-03-26T00:02:25Z) - Exploiting Summarization Data to Help Text Simplification [50.0624778757462]
We analyzed the similarity between text summarization and text simplification and exploited summarization data to help simplify.
We named these pairs Sum4Simp (S4S) and conducted human evaluations to show that S4S is high-quality.
arXiv Detail & Related papers (2023-02-14T15:32:04Z) - Document-Level Abstractive Summarization [0.0]
We study how efficient Transformer techniques can be used to improve the automatic summarization of very long texts.
We propose a novel retrieval-enhanced approach which reduces the cost of generating a summary of the entire document by processing smaller chunks.
arXiv Detail & Related papers (2022-12-06T14:39:09Z) - Value Retrieval with Arbitrary Queries for Form-like Documents [50.5532781148902]
We propose value retrieval with arbitrary queries for form-like documents.
Our method predicts target value for an arbitrary query based on the understanding of layout and semantics of a form.
We propose a simple document language modeling (simpleDLM) strategy to improve document understanding on large-scale model pre-training.
arXiv Detail & Related papers (2021-12-15T01:12:02Z) - Neural CRF Model for Sentence Alignment in Text Simplification [31.62648025127563]
We create two manually annotated sentence-aligned datasets from two commonly used text simplification corpora, Newsela and Wikipedia.
Experiments demonstrate that our proposed approach outperforms all the previous work on monolingual sentence alignment task by more than 5 points in F1.
A Transformer-based seq2seq model trained on our datasets establishes a new state-of-the-art for text simplification in both automatic and human evaluation.
arXiv Detail & Related papers (2020-05-05T16:47:51Z) - ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification
Models with Multiple Rewriting Transformations [97.27005783856285]
This paper introduces ASSET, a new dataset for assessing sentence simplification in English.
We show that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task.
arXiv Detail & Related papers (2020-05-01T16:44:54Z) - Pre-training for Abstractive Document Summarization by Reinstating
Source Text [105.77348528847337]
This paper presents three pre-training objectives which allow us to pre-train a Seq2Seq based abstractive summarization model on unlabeled text.
Experiments on two benchmark summarization datasets show that all three objectives can improve performance upon baselines.
arXiv Detail & Related papers (2020-04-04T05:06:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.