German Text Simplification: Finetuning Large Language Models with
Semi-Synthetic Data
- URL: http://arxiv.org/abs/2402.10675v1
- Date: Fri, 16 Feb 2024 13:28:44 GMT
- Title: German Text Simplification: Finetuning Large Language Models with
Semi-Synthetic Data
- Authors: Lars Kl\"oser, Mika Beele, Jan-Niklas Schagen, Bodo Kraft
- Abstract summary: This study pioneers the use of synthetically generated data for training generative models in document-level text simplification of German texts.
We finetune Large Language Models with up to 13 billion parameters on this data and evaluate their performance.
- Score: 0.7059555559002345
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study pioneers the use of synthetically generated data for training
generative models in document-level text simplification of German texts. We
demonstrate the effectiveness of our approach with real-world online texts.
Addressing the challenge of data scarcity in language simplification, we
crawled professionally simplified German texts and synthesized a corpus using
GPT-4. We finetune Large Language Models with up to 13 billion parameters on
this data and evaluate their performance. This paper employs various
methodologies for evaluation and demonstrates the limitations of currently used
rule-based metrics. Both automatic and manual evaluations reveal that our
models can significantly simplify real-world online texts, indicating the
potential of synthetic data in improving text simplification.
Related papers
- Instruction Data Generation and Unsupervised Adaptation for Speech Language Models [21.56355461403427]
We propose three methods for generating synthetic samples to train and evaluate multimodal large language models.
Synthetic data generation emerges as a crucial strategy to enhance the performance of such systems.
We highlight the potential of using unlabeled speech data to generate synthetic samples comparable in quality to those with available transcriptions.
arXiv Detail & Related papers (2024-06-18T08:27:00Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - A Novel Dataset for Financial Education Text Simplification in Spanish [4.475176409401273]
In Spanish, there are few datasets that can be used to create text simplification systems.
We created a dataset with 5,314 complex and simplified sentence pairs using established simplification rules.
arXiv Detail & Related papers (2023-12-15T15:47:08Z) - Language Models for German Text Simplification: Overcoming Parallel Data
Scarcity through Style-specific Pre-training [0.0]
We propose a two-step approach to overcome data scarcity issue.
First, we fine-tuned language models on a corpus of German Easy Language, a specific style of German.
We show that the language models adapt to the style characteristics of Easy Language and output more accessible texts.
arXiv Detail & Related papers (2023-05-22T10:41:30Z) - A Transfer Learning Based Model for Text Readability Assessment in
German [4.550811027560416]
We propose a new model for text complexity assessment for German text based on transfer learning.
Best model is based on the BERT pre-trained language model achieved the Root Mean Square Error (RMSE) of 0.483.
arXiv Detail & Related papers (2022-07-13T15:15:44Z) - How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z) - Fine-tuning GPT-3 for Russian Text Summarization [77.34726150561087]
This paper showcases ruGPT3 ability to summarize texts, fine-tuning it on the corpora of Russian news with their corresponding human-generated summaries.
We evaluate the resulting texts with a set of metrics, showing that our solution can surpass the state-of-the-art model's performance without additional changes in architecture or loss function.
arXiv Detail & Related papers (2021-08-07T19:01:40Z) - GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation [9.501648136713694]
Large-scale language models such as GPT-3 are excellent few-shot learners, allowing them to be controlled via natural text prompts.
This paper proposes a novel data augmentation technique that leverages large-scale language models to generate realistic text samples.
arXiv Detail & Related papers (2021-04-18T11:39:33Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - Exemplar-Controllable Paraphrasing and Translation using Bitext [57.92051459102902]
We adapt models from prior work to be able to learn solely from bilingual text (bitext)
Our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions.
arXiv Detail & Related papers (2020-10-12T17:02:50Z) - Progressive Generation of Long Text with Pretrained Language Models [83.62523163717448]
Large-scale language models (LMs) pretrained on massive corpora of text, such as GPT-2, are powerful open-domain text generators.
It is still challenging for such models to generate coherent long passages of text, especially when the models are fine-tuned to the target domain on a small corpus.
We propose a simple but effective method of generating text in a progressive manner, inspired by generating images from low to high resolution.
arXiv Detail & Related papers (2020-06-28T21:23:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.