A New Aligned Simple German Corpus
- URL: http://arxiv.org/abs/2209.01106v4
- Date: Fri, 26 May 2023 16:11:23 GMT
- Title: A New Aligned Simple German Corpus
- Authors: Vanessa Toborek and Moritz Busch and Malte Bo{\ss}ert and Christian
Bauckhage and Pascal Welke
- Abstract summary: We present a new sentence-aligned monolingual corpus for Simple German -- German.
It contains multiple document-aligned sources which we have aligned using automatic sentence-alignment methods.
The quality of our sentence alignments, as measured by F1-score, surpasses previous work.
- Score: 2.7981463795578927
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: "Leichte Sprache", the German counterpart to Simple English, is a regulated
language aiming to facilitate complex written language that would otherwise
stay inaccessible to different groups of people. We present a new
sentence-aligned monolingual corpus for Simple German -- German. It contains
multiple document-aligned sources which we have aligned using automatic
sentence-alignment methods. We evaluate our alignments based on a manually
labelled subset of aligned documents. The quality of our sentence alignments,
as measured by F1-score, surpasses previous work. We publish the dataset under
CC BY-SA and the accompanying code under MIT license.
Related papers
- A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - SentAlign: Accurate and Scalable Sentence Alignment [4.363828136730248]
SentAlign is an accurate sentence alignment tool designed to handle very large parallel document pairs.
The alignment algorithm evaluates all possible alignment paths in fairly large documents of thousands of sentences and uses a divide-and-conquer approach to align documents containing tens of thousands of sentences.
arXiv Detail & Related papers (2023-11-15T14:15:41Z) - Does mBERT understand Romansh? Evaluating word embeddings using word
alignment [0.0]
We test similarity-based word alignment models (SimAlign and awesome-align) in combination with word embeddings from mBERT and XLM-R on parallel sentences in German and Romansh.
Using embeddings from mBERT, both models reach an alignment error rate of 0.22, which outperforms fast_align.
We additionally present a gold standard for German-Romansh word alignment.
arXiv Detail & Related papers (2023-06-14T19:00:12Z) - A New Dataset and Empirical Study for Sentence Simplification in Chinese [50.0624778757462]
This paper introduces CSS, a new dataset for assessing sentence simplification in Chinese.
We collect manual simplifications from human annotators and perform data analysis to show the difference between English and Chinese sentence simplifications.
In the end, we explore whether Large Language Models can serve as high-quality Chinese sentence simplification systems by evaluating them on CSS.
arXiv Detail & Related papers (2023-06-07T06:47:34Z) - DEPLAIN: A German Parallel Corpus with Intralingual Translations into
Plain Language for Sentence and Document Simplification [1.5223905439199599]
This paper presents DEplain, a new dataset of parallel, professionally written and manually aligned simplifications in plain German.
We show that using DEplain to train a transformer-based seq2seq text simplification model can achieve promising results.
We make available the corpus, the adapted alignment methods for German, the web harvester and the trained models here.
arXiv Detail & Related papers (2023-05-30T11:07:46Z) - Towards Unsupervised Recognition of Token-level Semantic Differences in
Related Documents [61.63208012250885]
We formulate recognizing semantic differences as a token-level regression task.
We study three unsupervised approaches that rely on a masked language model.
Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels.
arXiv Detail & Related papers (2023-05-22T17:58:04Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - Subword Sampling for Low Resource Word Alignment [4.663577299263155]
We propose subword sampling-based alignment of text units.
We show that the subword sampling method consistently outperforms word-level alignment on six language pairs.
arXiv Detail & Related papers (2020-12-21T19:47:04Z) - Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text.
We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages.
We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z) - A Supervised Word Alignment Method based on Cross-Language Span
Prediction using Multilingual BERT [22.701728185474195]
We first formalize a word alignment problem as a collection of independent predictions from a token in the source sentence to a span in the target sentence.
We then solve this problem by using multilingual BERT, which is fine-tuned on a manually created gold word alignment data.
We show that the proposed method significantly outperformed previous supervised and unsupervised word alignment methods without using any bitexts for pretraining.
arXiv Detail & Related papers (2020-04-29T23:40:08Z) - Multilingual Alignment of Contextual Word Representations [49.42244463346612]
BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model.
We introduce a contextual version of word retrieval and show that it correlates well with downstream zero-shot transfer.
These results support contextual alignment as a useful concept for understanding large multilingual pre-trained models.
arXiv Detail & Related papers (2020-02-10T03:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.