MegaWika: Millions of reports and their sources across 50 diverse
languages
- URL: http://arxiv.org/abs/2307.07049v1
- Date: Thu, 13 Jul 2023 20:04:02 GMT
- Title: MegaWika: Millions of reports and their sources across 50 diverse
languages
- Authors: Samuel Barham and Orion Weller and Michelle Yuan and Kenton Murray and
Mahsa Yarmohammadi and Zhengping Jiang and Siddharth Vashishtha and Alexander
Martin and Anqi Liu and Aaron Steven White and Jordan Boyd-Graber and
Benjamin Van Durme
- Abstract summary: MegaWika consists of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials.
We process this dataset for a myriad of applications, including translating non-English articles for cross-lingual applications.
MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual.
- Score: 74.3909725023673
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: To foster the development of new models for collaborative AI-assisted report
generation, we introduce MegaWika, consisting of 13 million Wikipedia articles
in 50 diverse languages, along with their 71 million referenced source
materials. We process this dataset for a myriad of applications, going beyond
the initial Wikipedia citation extraction and web scraping of content,
including translating non-English articles for cross-lingual applications and
providing FrameNet parses for automated semantic analysis. MegaWika is the
largest resource for sentence-level report generation and the only report
generation dataset that is multilingual. We manually analyze the quality of
this resource through a semantically stratified sample. Finally, we provide
baseline results and trained models for crucial steps in automated report
generation: cross-lingual question answering and citation retrieval.
Related papers
- A New Massive Multilingual Dataset for High-Performance Language Technologies [14.375854322321997]
The HPLT language resources are a new massive multilingual dataset including both monolingual and bilingual corpora.
Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of 5.6 trillion word tokens de-duplicated on the document level.
Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens.
arXiv Detail & Related papers (2024-03-20T22:14:39Z) - XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation
in Low Resource Languages [11.581072296148031]
Existing work on Wikipedia text generation has focused on English only where English reference articles are summarized to generate English Wikipedia pages.
We propose XWikiGen, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text.
arXiv Detail & Related papers (2023-03-22T04:52:43Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - Building Machine Translation Systems for the Next Thousand Languages [102.24310122155073]
We describe results in three research domains: building clean, web-mined datasets for 1500+ languages, developing practical MT models for under-served languages, and studying the limitations of evaluation metrics for these languages.
We hope that our work provides useful insights to practitioners working towards building MT systems for currently understudied languages, and highlights research directions that can complement the weaknesses of massively multilingual models in data-sparse settings.
arXiv Detail & Related papers (2022-05-09T00:24:13Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - The Tatoeba Translation Challenge -- Realistic Data Sets for Low
Resource and Multilingual MT [0.0]
This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs.
The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages.
arXiv Detail & Related papers (2020-10-13T13:12:21Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - MLSUM: The Multilingual Summarization Corpus [29.943949944682196]
MLSUM is the first large-scale MultiLingual SUMmarization dataset.
It contains 1.5M+ article/summary pairs in five different languages.
arXiv Detail & Related papers (2020-04-30T15:58:34Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.