Related papers: The 2021 Tokyo Olympics Multilingual News Article Dataset

The 2021 Tokyo Olympics Multilingual News Article Dataset

URL: http://arxiv.org/abs/2502.06648v2
Date: Thu, 13 Feb 2025 20:46:57 GMT
Title: The 2021 Tokyo Olympics Multilingual News Article Dataset
Authors: Erik Novak, Erik Calcina, Dunja Mladenić, Marko Grobelnik,
Abstract summary: A total of 10,940 news articles were gathered from 1,918 different publishers covering 1,350 sub-events of the 2021 Olympics.<n>These articles are written in nine languages from different language families and in different scripts.<n>The development of this dataset aims to provide a resource for evaluating the performance of multilingual news clustering algorithms.
Score: 0.9749638953163389
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we introduce a dataset of multilingual news articles covering the 2021 Tokyo Olympics. A total of 10,940 news articles were gathered from 1,918 different publishers, covering 1,350 sub-events of the 2021 Olympics, and published between July 1, 2021, and August 14, 2021. These articles are written in nine languages from different language families and in different scripts. To create the dataset, the raw news articles were first retrieved via a service that collects and analyzes news articles. Then, the articles were grouped using an online clustering algorithm, with each group containing articles reporting on the same sub-event. Finally, the groups were manually annotated and evaluated. The development of this dataset aims to provide a resource for evaluating the performance of multilingual news clustering algorithms, for which limited datasets are available. It can also be used to analyze the dynamics and events of the 2021 Tokyo Olympics from different perspectives. The dataset is available in CSV format and can be accessed from the CLARIN.SI repository.

Related papers

20min-XD: A Comparable Corpus of Swiss News Articles [42.49142747741821]
We present 20min-XD (20 Minuten cross-lingual document-level), a French-German, document-level comparable corpus of news articles. Our dataset comprises around 15,000 article pairs spanning 2015 to 2024, automatically aligned based on semantic similarity. The resulting dataset exhibits a broad spectrum of cross-lingual similarity, ranging from near-translations to loosely related articles.
arXiv Detail & Related papers (2025-04-30T14:16:08Z)
A diverse Multilingual News Headlines Dataset from around the World [57.37355895609648]
Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide. It serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles.
arXiv Detail & Related papers (2024-03-28T12:08:39Z)
MegaWika: Millions of reports and their sources across 50 diverse languages [74.3909725023673]
MegaWika consists of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials. We process this dataset for a myriad of applications, including translating non-English articles for cross-lingual applications. MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual.
arXiv Detail & Related papers (2023-07-13T20:04:02Z)
Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages [40.01333053375582]
We aim to create a text classification dataset encompassing a large number of languages. We leverage parallel translations of the Bible to construct such a dataset. By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages.
arXiv Detail & Related papers (2023-05-15T09:43:32Z)
MN-DS: A Multilabeled News Dataset for News Articles Hierarchical Classification [0.0]
This article presents a dataset of 10,917 news articles with hierarchical news categories collected between 1 January 2019 and 31 December 2019. We manually labeled the articles based on a hierarchical taxonomy with 17 first-level and 109 second-level categories. This dataset can be used to train machine learning models for automatically classifying news articles by topic.
arXiv Detail & Related papers (2022-12-22T22:27:26Z)
\textit{NewsEdits}: A Dataset of Revision Histories for News Articles (Technical Report: Data Processing) [89.77347919191774]
textitNewsEdits is the first publicly available dataset of news article revision histories. It contains 1,278,804 articles with 4,609,430 versions from over 22 English- and French-language newspaper sources.
arXiv Detail & Related papers (2021-04-19T21:15:30Z)
Facebook AI's WMT20 News Translation Task Submission [69.92594751788403]
This paper describes Facebook AI's submission to WMT20 shared news translation task. We focus on the low resource setting and participate in two language pairs, Tamil -> English and Inuktitut -> English. We approach the low resource problem using two main strategies, leveraging all available data and adapting the system to the target news domain.
arXiv Detail & Related papers (2020-11-16T21:49:00Z)
A System for Worldwide COVID-19 Information Aggregation [92.60866520230803]
We build a system for worldwide COVID-19 information aggregation containing reliable articles from 10 regions in 7 languages sorted by topics. A neural machine translation module translates articles in other languages into Japanese and English. A BERT-based topic-classifier trained on our article-topic pair dataset helps users find their interested information efficiently.
arXiv Detail & Related papers (2020-07-28T01:33:54Z)
scb-mt-en-th-2020: A Large English-Thai Parallel Corpus [3.3072037841206354]
We construct an English-Thai machine translation dataset with over 1 million segment pairs. We train machine translation models based on this dataset. The dataset, pre-trained models, and source code to reproduce our work are available for public use.
arXiv Detail & Related papers (2020-07-07T15:14:32Z)
A High-Quality Multilingual Dataset for Structured Documentation Translation [101.41835967142521]
This paper presents a high-quality multilingual dataset for the documentation domain. We collect XML-structured parallel text segments from the online documentation for an enterprise software platform.
arXiv Detail & Related papers (2020-06-24T02:08:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.