German4All -- A Dataset and Model for Readability-Controlled Paraphrasing in German
- URL: http://arxiv.org/abs/2508.17973v2
- Date: Fri, 29 Aug 2025 05:23:50 GMT
- Title: German4All -- A Dataset and Model for Readability-Controlled Paraphrasing in German
- Authors: Miriam Anschütz, Thanh Mai Pham, Eslam Nasrallah, Maximilian Müller, Cristian-George Craciun, Georg Groh,
- Abstract summary: We introduce German4All, the first large-scale German dataset of aligned readability-controlled, paragraph-level paraphrases.<n>It spans five readability levels and comprises over 25,000 samples.<n>Using German4All, we train an open-source, readability-controlled paraphrasing model that achieves state-of-the-art performance in German text simplification.
- Score: 5.50777893297099
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The ability to paraphrase texts across different complexity levels is essential for creating accessible texts that can be tailored toward diverse reader groups. Thus, we introduce German4All, the first large-scale German dataset of aligned readability-controlled, paragraph-level paraphrases. It spans five readability levels and comprises over 25,000 samples. The dataset is automatically synthesized using GPT-4 and rigorously evaluated through both human and LLM-based judgments. Using German4All, we train an open-source, readability-controlled paraphrasing model that achieves state-of-the-art performance in German text simplification, enabling more nuanced and reader-specific adaptations. We opensource both the dataset and the model to encourage further research on multi-level paraphrasing
Related papers
- Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.<n>We introduce novel methodologies and datasets to overcome these challenges.<n>We propose MhBART, an encoder-decoder model designed to emulate human writing style.<n>We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models [5.713983191152314]
VTechAGP is the first academic-to-general-audience text paraphrase dataset.<n>For training, we leverage a contrastive-generative loss function to learn the keyword vectors in the dynamic prompt.<n>For inference, we adopt a crowd-sampling decoding strategy at both semantic and structural levels.
arXiv Detail & Related papers (2024-11-07T16:06:00Z) - Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models [1.565361244756411]
This paper explores how large language models (LLMs) can be used to generate and evaluate reading comprehension items.
We developed a protocol for human and automatic evaluation, including a metric we call text informativity.
Our results suggest that both models are capable of generating items of acceptable quality in a zero-shot setting, but GPT-4 clearly outperforms Llama 2.
arXiv Detail & Related papers (2024-04-11T13:11:21Z) - German Text Simplification: Finetuning Large Language Models with
Semi-Synthetic Data [0.7059555559002345]
This study pioneers the use of synthetically generated data for training generative models in document-level text simplification of German texts.
We finetune Large Language Models with up to 13 billion parameters on this data and evaluate their performance.
arXiv Detail & Related papers (2024-02-16T13:28:44Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Automatic Readability Assessment of German Sentences with Transformer
Ensembles [0.0]
We studied the ability of ensembles of fine-tuned GBERT and GPT-2-Wechsel models to reliably predict the readability of German sentences.
Mixed ensembles of GBERT and GPT-2-Wechsel performed better than ensembles of the same size consisting of only GBERT or GPT-2-Wechsel models.
arXiv Detail & Related papers (2022-09-09T13:47:55Z) - Pseudo-Labels Are All You Need [3.52359746858894]
We present our submission to the Text Complexity DE Challenge 2022.
The goal is to predict the complexity of a German sentence for German learners at level B.
We find that the pseudo-label-based approach gives impressive results yet requires little to no adjustment to the specific task.
arXiv Detail & Related papers (2022-08-19T09:52:41Z) - Evaluation of Transfer Learning for Polish with a Text-to-Text Model [54.81823151748415]
We introduce a new benchmark for assessing the quality of text-to-text models for Polish.
The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering.
We present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective.
arXiv Detail & Related papers (2022-05-18T09:17:14Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Exemplar-Controllable Paraphrasing and Translation using Bitext [57.92051459102902]
We adapt models from prior work to be able to learn solely from bilingual text (bitext)
Our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions.
arXiv Detail & Related papers (2020-10-12T17:02:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.