DEPLAIN: A German Parallel Corpus with Intralingual Translations into
Plain Language for Sentence and Document Simplification
- URL: http://arxiv.org/abs/2305.18939v1
- Date: Tue, 30 May 2023 11:07:46 GMT
- Title: DEPLAIN: A German Parallel Corpus with Intralingual Translations into
Plain Language for Sentence and Document Simplification
- Authors: Regina Stodden and Omar Momen and Laura Kallmeyer
- Abstract summary: This paper presents DEplain, a new dataset of parallel, professionally written and manually aligned simplifications in plain German.
We show that using DEplain to train a transformer-based seq2seq text simplification model can achieve promising results.
We make available the corpus, the adapted alignment methods for German, the web harvester and the trained models here.
- Score: 1.5223905439199599
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Text simplification is an intralingual translation task in which documents,
or sentences of a complex source text are simplified for a target audience. The
success of automatic text simplification systems is highly dependent on the
quality of parallel data used for training and evaluation. To advance sentence
simplification and document simplification in German, this paper presents
DEplain, a new dataset of parallel, professionally written and manually aligned
simplifications in plain German ("plain DE" or in German: "Einfache Sprache").
DEplain consists of a news domain (approx. 500 document pairs, approx. 13k
sentence pairs) and a web-domain corpus (approx. 150 aligned documents, approx.
2k aligned sentence pairs). In addition, we are building a web harvester and
experimenting with automatic alignment methods to facilitate the integration of
non-aligned and to be published parallel documents. Using this approach, we are
dynamically increasing the web domain corpus, so it is currently extended to
approx. 750 document pairs and approx. 3.5k aligned sentence pairs. We show
that using DEplain to train a transformer-based seq2seq text simplification
model can achieve promising results. We make available the corpus, the adapted
alignment methods for German, the web harvester and the trained models here:
https://github.com/rstodden/DEPlain.
Related papers
- SentAlign: Accurate and Scalable Sentence Alignment [4.363828136730248]
SentAlign is an accurate sentence alignment tool designed to handle very large parallel document pairs.
The alignment algorithm evaluates all possible alignment paths in fairly large documents of thousands of sentences and uses a divide-and-conquer approach to align documents containing tens of thousands of sentences.
arXiv Detail & Related papers (2023-11-15T14:15:41Z) - Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - In-context Pretraining: Language Modeling Beyond Document Boundaries [137.53145699439898]
In-Context Pretraining is a new approach where language models are pretrained on a sequence of related documents.
We introduce approximate algorithms for finding related documents with efficient nearest neighbor search.
We see notable improvements in tasks that require more complex contextual reasoning.
arXiv Detail & Related papers (2023-10-16T17:57:12Z) - Language Models for German Text Simplification: Overcoming Parallel Data
Scarcity through Style-specific Pre-training [0.0]
We propose a two-step approach to overcome data scarcity issue.
First, we fine-tuned language models on a corpus of German Easy Language, a specific style of German.
We show that the language models adapt to the style characteristics of Easy Language and output more accessible texts.
arXiv Detail & Related papers (2023-05-22T10:41:30Z) - Dual-Alignment Pre-training for Cross-lingual Sentence Embedding [79.98111074307657]
We propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding.
We introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart.
Our approach can significantly improve sentence embedding.
arXiv Detail & Related papers (2023-05-16T03:53:30Z) - A New Aligned Simple German Corpus [2.7981463795578927]
We present a new sentence-aligned monolingual corpus for Simple German -- German.
It contains multiple document-aligned sources which we have aligned using automatic sentence-alignment methods.
The quality of our sentence alignments, as measured by F1-score, surpasses previous work.
arXiv Detail & Related papers (2022-09-02T15:14:04Z) - Klexikon: A German Dataset for Joint Summarization and Simplification [2.931632009516441]
We create a new dataset for joint Text Simplification and Summarization based on German Wikipedia and the German children's lexicon "Klexikon"
We highlight the summarization aspect and provide statistical evidence that this resource is well suited to simplification as well.
arXiv Detail & Related papers (2022-01-18T18:50:43Z) - Document-Level Text Simplification: Dataset, Criteria and Baseline [75.58761130635824]
We define and investigate a new task of document-level text simplification.
Based on Wikipedia dumps, we first construct a large-scale dataset named D-Wikipedia.
We propose a new automatic evaluation metric called D-SARI that is more suitable for the document-level simplification task.
arXiv Detail & Related papers (2021-10-11T08:15:31Z) - Neural CRF Model for Sentence Alignment in Text Simplification [31.62648025127563]
We create two manually annotated sentence-aligned datasets from two commonly used text simplification corpora, Newsela and Wikipedia.
Experiments demonstrate that our proposed approach outperforms all the previous work on monolingual sentence alignment task by more than 5 points in F1.
A Transformer-based seq2seq model trained on our datasets establishes a new state-of-the-art for text simplification in both automatic and human evaluation.
arXiv Detail & Related papers (2020-05-05T16:47:51Z) - ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification
Models with Multiple Rewriting Transformations [97.27005783856285]
This paper introduces ASSET, a new dataset for assessing sentence simplification in English.
We show that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task.
arXiv Detail & Related papers (2020-05-01T16:44:54Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.