Mapping Languages: The Corpus of Global Language Use
- URL: http://arxiv.org/abs/2004.00798v1
- Date: Thu, 2 Apr 2020 03:42:14 GMT
- Title: Mapping Languages: The Corpus of Global Language Use
- Authors: Jonathan Dunn
- Abstract summary: This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping.
In total, the corpus contains 423 billion words representing 148 languages and 158 countries.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes a web-based corpus of global language use with a focus
on how this corpus can be used for data-driven language mapping. First, the
corpus provides a representation of where national varieties of major languages
are used (e.g., English, Arabic, Russian) together with consistently collected
data for each variety. Second, the paper evaluates a language identification
model that supports more local languages with smaller sample sizes than
alternative off-the-shelf models. Improved language identification is essential
for moving beyond majority languages. Given the focus on language mapping, the
paper analyzes how well this digital language data represents actual
populations by (i) systematically comparing the corpus with demographic
ground-truth data and (ii) triangulating the corpus with an alternate
Twitter-based dataset. In total, the corpus contains 423 billion words
representing 148 languages (with over 1 million words from each language) and
158 countries (again with over 1 million words from each country), all
distilled from Common Crawl web data. The main contribution of this paper, in
addition to describing this publicly-available corpus, is to provide a
comprehensive analysis of the relationship between two sources of digital data
(the web and Twitter) as well as their connection to underlying populations.
Related papers
- Validating and Exploring Large Geographic Corpora [0.76146285961466]
Three methods are used to improve the quality of sub-corpora representing specific language-country pairs like New Zealand English.
The evaluation shows that the validity of sub-corpora is improved with each stage of cleaning but that this improvement is unevenly distributed across languages and populations.
arXiv Detail & Related papers (2024-03-13T02:46:17Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Corpus Similarity Measures Remain Robust Across Diverse Languages [0.0]
This paper experiments with frequency-based corpus similarity measures across 39 languages using a register prediction task.
The goal is to quantify (i) the distance between different corpora from the same language and (ii) the homogeneity of individual corpora.
Results show that measures of corpus similarity retain their validity across different language families, writing systems, and types of morphology.
arXiv Detail & Related papers (2022-06-09T08:17:16Z) - Lahjoita puhetta -- a large-scale corpus of spoken Finnish with some
benchmarks [9.160401226886947]
The Donate Speech campaign has so far succeeded in gathering approximately 3600 hours of ordinary, colloquial Finnish speech.
The primary goals of the collection were to create a representative, large-scale resource to study spontaneous spoken Finnish and to accelerate the development of language technology and speech-based services.
We present the collection process and the collected corpus, and showcase its versatility through multiple use cases.
arXiv Detail & Related papers (2022-03-24T07:50:25Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual
Sentiment Analysis [5.048355865260207]
We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria.
The dataset consists of around 30,000 annotated tweets per language.
We release the datasets, trained models, sentiment lexicons, and code to incentivize research on sentiment analysis in under-represented languages.
arXiv Detail & Related papers (2022-01-20T16:28:06Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - A Corpus for Large-Scale Phonetic Typology [112.19288631037055]
We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology.
aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants.
arXiv Detail & Related papers (2020-05-28T13:03:51Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.