MUSS: Multilingual Unsupervised Sentence Simplification by Mining
Paraphrases
- URL: http://arxiv.org/abs/2005.00352v2
- Date: Fri, 16 Apr 2021 15:08:50 GMT
- Title: MUSS: Multilingual Unsupervised Sentence Simplification by Mining
Paraphrases
- Authors: Louis Martin, Angela Fan, \'Eric de la Clergerie, Antoine Bordes,
Beno\^it Sagot
- Abstract summary: We introduce MUSS, a Multilingual Unsupervised Sentence Simplification system that does not require labeled simplification data.
MUSS uses a novel approach to sentence simplification that trains strong models using sentence-level paraphrase data instead of proper simplification data.
We evaluate our approach on English, French, and Spanish simplification benchmarks and closely match or outperform the previous best supervised results.
- Score: 20.84836431084352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Progress in sentence simplification has been hindered by a lack of labeled
parallel simplification data, particularly in languages other than English. We
introduce MUSS, a Multilingual Unsupervised Sentence Simplification system that
does not require labeled simplification data. MUSS uses a novel approach to
sentence simplification that trains strong models using sentence-level
paraphrase data instead of proper simplification data. These models leverage
unsupervised pretraining and controllable generation mechanisms to flexibly
adjust attributes such as length and lexical complexity at inference time. We
further present a method to mine such paraphrase data in any language from
Common Crawl using semantic sentence embeddings, thus removing the need for
labeled data. We evaluate our approach on English, French, and Spanish
simplification benchmarks and closely match or outperform the previous best
supervised results, despite not using any labeled simplification data. We push
the state of the art further by incorporating labeled simplification data.
Related papers
- Evaluating Document Simplification: On the Importance of Separately Assessing Simplicity and Meaning Preservation [9.618393813409266]
This paper focuses on the evaluation of document-level text simplification.
We compare existing models using distinct metrics for meaning preservation and simplification.
We introduce a reference-less metric variant for simplicity, showing that models are mostly biased towards either simplification or meaning preservation.
arXiv Detail & Related papers (2024-04-04T08:04:24Z) - A New Dataset and Empirical Study for Sentence Simplification in Chinese [50.0624778757462]
This paper introduces CSS, a new dataset for assessing sentence simplification in Chinese.
We collect manual simplifications from human annotators and perform data analysis to show the difference between English and Chinese sentence simplifications.
In the end, we explore whether Large Language Models can serve as high-quality Chinese sentence simplification systems by evaluating them on CSS.
arXiv Detail & Related papers (2023-06-07T06:47:34Z) - Language Models for German Text Simplification: Overcoming Parallel Data
Scarcity through Style-specific Pre-training [0.0]
We propose a two-step approach to overcome data scarcity issue.
First, we fine-tuned language models on a corpus of German Easy Language, a specific style of German.
We show that the language models adapt to the style characteristics of Easy Language and output more accessible texts.
arXiv Detail & Related papers (2023-05-22T10:41:30Z) - SASS: Data and Methods for Subject Aware Sentence Simplification [0.0]
This paper provides a dataset aimed at training models that perform subject aware sentence simplifications.
We also test models on that dataset which are inspired by model architecture used in abstractive summarization.
arXiv Detail & Related papers (2023-03-26T00:02:25Z) - Exploiting Summarization Data to Help Text Simplification [50.0624778757462]
We analyzed the similarity between text summarization and text simplification and exploited summarization data to help simplify.
We named these pairs Sum4Simp (S4S) and conducted human evaluations to show that S4S is high-quality.
arXiv Detail & Related papers (2023-02-14T15:32:04Z) - Explain to me like I am five -- Sentence Simplification Using
Transformers [2.017876577978849]
Sentence simplification aims at making the structure of text easier to read and understand while maintaining its original meaning.
This can be helpful for people with disabilities, new language learners, or those with low literacy.
Previous research have focused on tackling this task by either using external linguistic databases for simplification or by using control tokens for desired fine-tuning of sentences.
We experiment with a combination of GPT-2 and BERT models, achieving the best SARI score of 46.80 on the Mechanical Turk dataset.
arXiv Detail & Related papers (2022-12-08T22:57:18Z) - Self-Training Sampling with Monolingual Data Uncertainty for Neural
Machine Translation [98.83925811122795]
We propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data.
We compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data.
Experimental results on large-scale WMT English$Rightarrow$German and English$Rightarrow$Chinese datasets demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2021-06-02T05:01:36Z) - Controllable Text Simplification with Explicit Paraphrasing [88.02804405275785]
Text Simplification improves the readability of sentences through several rewriting transformations, such as lexical paraphrasing, deletion, and splitting.
Current simplification systems are predominantly sequence-to-sequence models that are trained end-to-end to perform all these operations simultaneously.
We propose a novel hybrid approach that leverages linguistically-motivated rules for splitting and deletion, and couples them with a neural paraphrasing model to produce varied rewriting styles.
arXiv Detail & Related papers (2020-10-21T13:44:40Z) - ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification
Models with Multiple Rewriting Transformations [97.27005783856285]
This paper introduces ASSET, a new dataset for assessing sentence simplification in English.
We show that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task.
arXiv Detail & Related papers (2020-05-01T16:44:54Z) - Semi-Supervised Models via Data Augmentationfor Classifying Interactive
Affective Responses [85.04362095899656]
We present semi-supervised models with data augmentation (SMDA), a semi-supervised text classification system to classify interactive affective responses.
For labeled sentences, we performed data augmentations to uniform the label distributions and computed supervised loss during training process.
For unlabeled sentences, we explored self-training by regarding low-entropy predictions over unlabeled sentences as pseudo labels.
arXiv Detail & Related papers (2020-04-23T05:02:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.