Language Models for German Text Simplification: Overcoming Parallel Data
Scarcity through Style-specific Pre-training
- URL: http://arxiv.org/abs/2305.12908v1
- Date: Mon, 22 May 2023 10:41:30 GMT
- Title: Language Models for German Text Simplification: Overcoming Parallel Data
Scarcity through Style-specific Pre-training
- Authors: Miriam Ansch\"utz, Joshua Oehms, Thomas Wimmer, Bart{\l}omiej
Jezierski, Georg Groh
- Abstract summary: We propose a two-step approach to overcome data scarcity issue.
First, we fine-tuned language models on a corpus of German Easy Language, a specific style of German.
We show that the language models adapt to the style characteristics of Easy Language and output more accessible texts.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Automatic text simplification systems help to reduce textual information
barriers on the internet. However, for languages other than English, only few
parallel data to train these systems exists. We propose a two-step approach to
overcome this data scarcity issue. First, we fine-tuned language models on a
corpus of German Easy Language, a specific style of German. Then, we used these
models as decoders in a sequence-to-sequence simplification task. We show that
the language models adapt to the style characteristics of Easy Language and
output more accessible texts. Moreover, with the style-specific pre-training,
we reduced the number of trainable parameters in text simplification models.
Hence, less parallel data is sufficient for training. Our results indicate that
pre-training on unaligned data can reduce the required parallel data while
improving the performance on downstream tasks.
Related papers
- German Text Simplification: Finetuning Large Language Models with
Semi-Synthetic Data [0.7059555559002345]
This study pioneers the use of synthetically generated data for training generative models in document-level text simplification of German texts.
We finetune Large Language Models with up to 13 billion parameters on this data and evaluate their performance.
arXiv Detail & Related papers (2024-02-16T13:28:44Z) - Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation [82.5217996570387]
We adapt a pre-trained language model for auto-regressive text-to-image generation.
We find that pre-trained language models offer limited help.
arXiv Detail & Related papers (2023-11-27T07:19:26Z) - Improving Neural Machine Translation by Bidirectional Training [85.64797317290349]
We present a simple and effective pretraining strategy -- bidirectional training (BiT) for neural machine translation.
Specifically, we bidirectionally update the model parameters at the early stage and then tune the model normally.
Experimental results show that BiT pushes the SOTA neural machine translation performance across 15 translation tasks on 8 language pairs significantly higher.
arXiv Detail & Related papers (2021-09-16T07:58:33Z) - Translate & Fill: Improving Zero-Shot Multilingual Semantic Parsing with
Synthetic Data [2.225882303328135]
We propose a novel Translate-and-Fill (TaF) method to produce silver training data for a multilingual semantic parsing task.
Experimental results on three multilingual semantic parsing datasets show that data augmentation with TaF reaches accuracies competitive with similar systems.
arXiv Detail & Related papers (2021-09-09T14:51:11Z) - Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word
Alignment [49.45399359826453]
Cross-lingual language models are typically pretrained with language modeling on multilingual text or parallel sentences.
We introduce denoising word alignment as a new cross-lingual pre-training task.
Experimental results show that our method improves cross-lingual transferability on various datasets.
arXiv Detail & Related papers (2021-06-11T13:36:01Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - Multilingual Neural Semantic Parsing for Low-Resourced Languages [1.6244541005112747]
We introduce a new multilingual semantic parsing dataset in English, Italian and Japanese.
We show that joint multilingual training with pretrained encoders substantially outperforms our baselines on the TOP dataset.
We find that a semantic trained only on English data achieves a zero-shot performance of 44.9% exact-match accuracy on Italian sentences.
arXiv Detail & Related papers (2021-06-07T09:53:02Z) - Token-wise Curriculum Learning for Neural Machine Translation [94.93133801641707]
Existing curriculum learning approaches to Neural Machine Translation (NMT) require sufficient sampling amounts of "easy" samples from training data at the early training stage.
We propose a novel token-wise curriculum learning approach that creates sufficient amounts of easy samples.
Our approach can consistently outperform baselines on 5 language pairs, especially for low-resource languages.
arXiv Detail & Related papers (2021-03-20T03:57:59Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - MUSS: Multilingual Unsupervised Sentence Simplification by Mining
Paraphrases [20.84836431084352]
We introduce MUSS, a Multilingual Unsupervised Sentence Simplification system that does not require labeled simplification data.
MUSS uses a novel approach to sentence simplification that trains strong models using sentence-level paraphrase data instead of proper simplification data.
We evaluate our approach on English, French, and Spanish simplification benchmarks and closely match or outperform the previous best supervised results.
arXiv Detail & Related papers (2020-05-01T12:54:30Z) - Semi-Supervised Text Simplification with Back-Translation and Asymmetric
Denoising Autoencoders [37.949101113934226]
Text simplification (TS) rephrases long sentences into simplified variants while preserving inherent semantics.
This work investigates how to leverage large amounts of unpaired corpora in TS task.
We propose asymmetric denoising methods for sentences with separate complexity.
arXiv Detail & Related papers (2020-04-30T11:19:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.