Neural CRF Model for Sentence Alignment in Text Simplification
- URL: http://arxiv.org/abs/2005.02324v4
- Date: Mon, 30 Aug 2021 18:15:31 GMT
- Title: Neural CRF Model for Sentence Alignment in Text Simplification
- Authors: Chao Jiang, Mounica Maddela, Wuwei Lan, Yang Zhong, Wei Xu
- Abstract summary: We create two manually annotated sentence-aligned datasets from two commonly used text simplification corpora, Newsela and Wikipedia.
Experiments demonstrate that our proposed approach outperforms all the previous work on monolingual sentence alignment task by more than 5 points in F1.
A Transformer-based seq2seq model trained on our datasets establishes a new state-of-the-art for text simplification in both automatic and human evaluation.
- Score: 31.62648025127563
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of a text simplification system heavily depends on the quality
and quantity of complex-simple sentence pairs in the training corpus, which are
extracted by aligning sentences between parallel articles. To evaluate and
improve sentence alignment quality, we create two manually annotated
sentence-aligned datasets from two commonly used text simplification corpora,
Newsela and Wikipedia. We propose a novel neural CRF alignment model which not
only leverages the sequential nature of sentences in parallel documents but
also utilizes a neural sentence pair model to capture semantic similarity.
Experiments demonstrate that our proposed approach outperforms all the previous
work on monolingual sentence alignment task by more than 5 points in F1. We
apply our CRF aligner to construct two new text simplification datasets,
Newsela-Auto and Wiki-Auto, which are much larger and of better quality
compared to the existing datasets. A Transformer-based seq2seq model trained on
our datasets establishes a new state-of-the-art for text simplification in both
automatic and human evaluation.
Related papers
- Learning to Paraphrase Sentences to Different Complexity Levels [3.0273878903284275]
Sentence simplification is an active research topic in NLP, but its adjacent tasks of sentence complexification and same-level paraphrasing are not.
To train models on all three tasks, we present two new unsupervised datasets.
arXiv Detail & Related papers (2023-08-04T09:43:37Z) - Exploiting Summarization Data to Help Text Simplification [50.0624778757462]
We analyzed the similarity between text summarization and text simplification and exploited summarization data to help simplify.
We named these pairs Sum4Simp (S4S) and conducted human evaluations to show that S4S is high-quality.
arXiv Detail & Related papers (2023-02-14T15:32:04Z) - HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text
Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization.
Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z) - Document-Level Text Simplification: Dataset, Criteria and Baseline [75.58761130635824]
We define and investigate a new task of document-level text simplification.
Based on Wikipedia dumps, we first construct a large-scale dataset named D-Wikipedia.
We propose a new automatic evaluation metric called D-SARI that is more suitable for the document-level simplification task.
arXiv Detail & Related papers (2021-10-11T08:15:31Z) - Neural semi-Markov CRF for Monolingual Word Alignment [20.897157172049877]
We present a novel neural semi-Markov CRF alignment model, which unifies word and phrase alignments through variable-length spans.
We also create a new benchmark with human annotations that cover four different text genres to evaluate monolingual word alignment models.
arXiv Detail & Related papers (2021-06-04T16:04:00Z) - Neural Data-to-Text Generation with LM-based Text Augmentation [27.822282190362856]
We show that a weakly supervised training paradigm is able to outperform fully supervised seq2seq models with less than 10% annotations.
By utilizing all annotated data, our model can boost the performance of a standard seq2seq model by over 5 BLEU points.
arXiv Detail & Related papers (2021-02-06T10:21:48Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - Data-to-Text Generation with Iterative Text Editing [3.42658286826597]
We present a novel approach to data-to-text generation based on iterative text editing.
We first transform data items to text using trivial templates, and then we iteratively improve the resulting text by a neural model trained for the sentence fusion task.
The output of the model is filtered by a simple and reranked with an off-the-shelf pre-trained language model.
arXiv Detail & Related papers (2020-11-03T13:32:38Z) - ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification
Models with Multiple Rewriting Transformations [97.27005783856285]
This paper introduces ASSET, a new dataset for assessing sentence simplification in English.
We show that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task.
arXiv Detail & Related papers (2020-05-01T16:44:54Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.