Validating and Exploring Large Geographic Corpora
- URL: http://arxiv.org/abs/2403.08198v1
- Date: Wed, 13 Mar 2024 02:46:17 GMT
- Title: Validating and Exploring Large Geographic Corpora
- Authors: Jonathan Dunn
- Abstract summary: Three methods are used to improve the quality of sub-corpora representing specific language-country pairs like New Zealand English.
The evaluation shows that the validity of sub-corpora is improved with each stage of cleaning but that this improvement is unevenly distributed across languages and populations.
- Score: 0.76146285961466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper investigates the impact of corpus creation decisions on large
multi-lingual geographic web corpora. Beginning with a 427 billion word corpus
derived from the Common Crawl, three methods are used to improve the quality of
sub-corpora representing specific language-country pairs like New Zealand
English: (i) the agreement of independent language identification systems, (ii)
hash-based deduplication, and (iii) location-specific outlier detection. The
impact of each of these steps is then evaluated at the language level and the
country level by using corpus similarity measures to compare each resulting
corpus with baseline data sets. The goal is to understand the impact of
upstream data cleaning decisions on downstream corpora with a specific focus on
under-represented languages and populations. The evaluation shows that the
validity of sub-corpora is improved with each stage of cleaning but that this
improvement is unevenly distributed across languages and populations. This
result shows how standard corpus creation techniques can accidentally exclude
under-represented populations.
Related papers
- The Empirical Impact of Data Sanitization on Language Models [1.1359551336076306]
This paper empirically analyzes the effects of data sanitization across several benchmark language-modeling tasks.
Our results suggest that for some tasks such as sentiment analysis or entailment, the impact of redaction is quite low, typically around 1-5%.
For tasks such as comprehension Q&A there is a big drop of >25% in performance observed in redacted queries as compared to the original.
arXiv Detail & Related papers (2024-11-08T21:22:37Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on
POS Tagging for Non-Standardized Languages [18.210880703295253]
We finetune pretrained language models (PLMs) on seven languages from three different families.
We analyze their zero-shot performance on closely related, non-standardized varieties.
Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data is the strongest predictor for model performance on target data.
arXiv Detail & Related papers (2023-04-20T08:32:34Z) - Corpus Similarity Measures Remain Robust Across Diverse Languages [0.0]
This paper experiments with frequency-based corpus similarity measures across 39 languages using a register prediction task.
The goal is to quantify (i) the distance between different corpora from the same language and (ii) the homogeneity of individual corpora.
Results show that measures of corpus similarity retain their validity across different language families, writing systems, and types of morphology.
arXiv Detail & Related papers (2022-06-09T08:17:16Z) - EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation [63.88541605363555]
"Extract and Generate" (EAG) is a two-step approach to construct large-scale and high-quality multi-way aligned corpus from bilingual data.
We first extract candidate aligned examples by pairing the bilingual examples from different language pairs with highly similar source or target sentences.
We then generate the final aligned examples from the candidates with a well-trained generation model.
arXiv Detail & Related papers (2022-03-04T08:21:27Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Global Syntactic Variation in Seven Languages: Towards a Computational
Dialectology [0.0]
We use Computational Construction Grammar to provide a replicable and falsifiable set of syntactic features.
We use global language mapping based on web-crawled and social media datasets to determine the selection of national varieties.
Results show that models for each language are able to robustly predict the region-of-origin of held-out samples better using Construction Grammars.
arXiv Detail & Related papers (2021-04-03T03:40:21Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z) - Mapping Languages: The Corpus of Global Language Use [0.0]
This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping.
In total, the corpus contains 423 billion words representing 148 languages and 158 countries.
arXiv Detail & Related papers (2020-04-02T03:42:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.