Multilingual Open Text 1.0: Public Domain News in 44 Languages
- URL: http://arxiv.org/abs/2201.05609v1
- Date: Fri, 14 Jan 2022 18:58:17 GMT
- Title: Multilingual Open Text 1.0: Public Domain News in 44 Languages
- Authors: Chester Palen-Michel, June Kim, Constantine Lignos
- Abstract summary: The first release of the corpus contains over 2.7 million news articles and 1 million shorter passages published between 2001--2021.
The source material is in the public domain, our collection is licensed using a creative commons license (CC BY 4.0) and all software used to create the corpus is released under the MIT License.
- Score: 2.642698101441705
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a new multilingual corpus containing text in 44 languages, many of
which have relatively few existing resources for natural language processing.
The first release of the corpus contains over 2.7 million news articles and 1
million shorter passages published between 2001--2021, collected from Voice of
America news websites. We describe our process for collecting, filtering, and
processing the data. The source material is in the public domain, our
collection is licensed using a creative commons license (CC BY 4.0), and all
software used to create the corpus is released under the MIT License. The
corpus will be regularly updated as additional documents are published.
Related papers
- GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages [53.56700754408902]
GlotCC is a clean, document-level, 2TB general domain corpus derived from CommonCrawl.
We make GlotCC and the system used to generate it available to the research community.
arXiv Detail & Related papers (2024-10-31T11:14:12Z) - Towards Robust Speech Representation Learning for Thousands of Languages [77.2890285555615]
Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data.
We propose XEUS, a Cross-lingual for Universal Speech, trained on over 1 million hours of data across 4057 languages.
arXiv Detail & Related papers (2024-06-30T21:40:26Z) - A New Massive Multilingual Dataset for High-Performance Language Technologies [14.375854322321997]
The HPLT language resources are a new massive multilingual dataset including both monolingual and bilingual corpora.
Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of 5.6 trillion word tokens de-duplicated on the document level.
Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens.
arXiv Detail & Related papers (2024-03-20T22:14:39Z) - MegaWika: Millions of reports and their sources across 50 diverse
languages [74.3909725023673]
MegaWika consists of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials.
We process this dataset for a myriad of applications, including translating non-English articles for cross-lingual applications.
MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual.
arXiv Detail & Related papers (2023-07-13T20:04:02Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - LR-Sum: Summarization for Less-Resourced Languages [12.605915166622818]
This preprint describes work in progress on LR-Sum, a new permissively-licensed dataset.
LR-Sum contains human-written summaries for 40 languages, many of which are less-resourced.
The source data is public domain newswire collected from from Voice of America websites, and LR-Sum is released under a Creative Commons license (CC BY 4.0)
arXiv Detail & Related papers (2022-12-19T18:00:09Z) - \textit{NewsEdits}: A Dataset of Revision Histories for News Articles
(Technical Report: Data Processing) [89.77347919191774]
textitNewsEdits is the first publicly available dataset of news article revision histories.
It contains 1,278,804 articles with 4,609,430 versions from over 22 English- and French-language newspaper sources.
arXiv Detail & Related papers (2021-04-19T21:15:30Z) - The Multilingual TEDx Corpus for Speech Recognition and Translation [30.993199499048824]
We present the Multilingual TEDx corpus, built to support speech recognition (ASR) and speech translation (ST) research across many non-English source languages.
The corpus is a collection of audio recordings from TEDx talks in 8 source languages.
We segment transcripts into sentences and align them to the source-language audio and target-language translations.
arXiv Detail & Related papers (2021-02-02T21:16:25Z) - CoVoST 2 and Massively Multilingual Speech-to-Text Translation [24.904548615918355]
CoVoST 2 is a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages.
This represents the largest open dataset available to date from total volume and language coverage perspective.
arXiv Detail & Related papers (2020-07-20T17:53:35Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.