20min-XD: A Comparable Corpus of Swiss News Articles
- URL: http://arxiv.org/abs/2504.21677v1
- Date: Wed, 30 Apr 2025 14:16:08 GMT
- Title: 20min-XD: A Comparable Corpus of Swiss News Articles
- Authors: Michelle Wastl, Jannis Vamvas, Selena Calleri, Rico Sennrich,
- Abstract summary: We present 20min-XD (20 Minuten cross-lingual document-level), a French-German, document-level comparable corpus of news articles.<n>Our dataset comprises around 15,000 article pairs spanning 2015 to 2024, automatically aligned based on semantic similarity.<n>The resulting dataset exhibits a broad spectrum of cross-lingual similarity, ranging from near-translations to loosely related articles.
- Score: 42.49142747741821
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present 20min-XD (20 Minuten cross-lingual document-level), a French-German, document-level comparable corpus of news articles, sourced from the Swiss online news outlet 20 Minuten/20 minutes. Our dataset comprises around 15,000 article pairs spanning 2015 to 2024, automatically aligned based on semantic similarity. We detail the data collection process and alignment methodology. Furthermore, we provide a qualitative and quantitative analysis of the corpus. The resulting dataset exhibits a broad spectrum of cross-lingual similarity, ranging from near-translations to loosely related articles, making it valuable for various NLP applications and broad linguistically motivated studies. We publicly release the dataset in document- and sentence-aligned versions and code for the described experiments.
Related papers
- The 2021 Tokyo Olympics Multilingual News Article Dataset [0.9749638953163389]
A total of 10,940 news articles were gathered from 1,918 different publishers covering 1,350 sub-events of the 2021 Olympics.<n>These articles are written in nine languages from different language families and in different scripts.<n>The development of this dataset aims to provide a resource for evaluating the performance of multilingual news clustering algorithms.
arXiv Detail & Related papers (2025-02-10T16:38:03Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - X-PARADE: Cross-Lingual Textual Entailment and Information Divergence across Paragraphs [55.80189506270598]
X-PARADE is the first cross-lingual dataset of paragraph-level information divergences.
Annotators label a paragraph in a target language at the span level and evaluate it with respect to a corresponding paragraph in a source language.
Aligned paragraphs are sourced from Wikipedia pages in different languages.
arXiv Detail & Related papers (2023-09-16T04:34:55Z) - Shuffle & Divide: Contrastive Learning for Long Text [6.187839874846451]
We propose a self-supervised learning method for long text documents based on contrastive learning.
A key to our method is Shuffle and Divide (SaD), a simple text augmentation algorithm.
We have empirically evaluated our method by performing unsupervised text classification on the 20 Newsgroups, Reuters-21578, BBC, and BBCSport datasets.
arXiv Detail & Related papers (2023-04-19T02:02:29Z) - LANS: Large-scale Arabic News Summarization Corpus [20.835296945483275]
We build, LANS, a large-scale and diverse dataset for Arabic Text Summarization task.
LANS offers 8.4 million articles and their summaries extracted from newspapers websites metadata between 1999 and 2019.
arXiv Detail & Related papers (2022-10-24T20:54:01Z) - EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation [63.88541605363555]
"Extract and Generate" (EAG) is a two-step approach to construct large-scale and high-quality multi-way aligned corpus from bilingual data.
We first extract candidate aligned examples by pairing the bilingual examples from different language pairs with highly similar source or target sentences.
We then generate the final aligned examples from the candidates with a well-trained generation model.
arXiv Detail & Related papers (2022-03-04T08:21:27Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive
Summarization [41.578594261746055]
We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems.
We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors.
We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article.
arXiv Detail & Related papers (2020-10-07T00:28:05Z) - A High-Quality Multilingual Dataset for Structured Documentation
Translation [101.41835967142521]
This paper presents a high-quality multilingual dataset for the documentation domain.
We collect XML-structured parallel text segments from the online documentation for an enterprise software platform.
arXiv Detail & Related papers (2020-06-24T02:08:44Z) - Massively Multilingual Document Alignment with Cross-lingual
Sentence-Mover's Distance [8.395430195053061]
Document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other.
We develop an unsupervised scoring function that leverages cross-lingual sentence embeddings to compute the semantic distance between documents in different languages.
These semantic distances are then used to guide a document alignment algorithm to properly pair cross-lingual web documents across a variety of low, mid, and high-resource language pairs.
arXiv Detail & Related papers (2020-01-31T05:14:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.