Corpus Similarity Measures Remain Robust Across Diverse Languages
- URL: http://arxiv.org/abs/2206.04332v1
- Date: Thu, 9 Jun 2022 08:17:16 GMT
- Title: Corpus Similarity Measures Remain Robust Across Diverse Languages
- Authors: Haipeng Li and Jonathan Dunn
- Abstract summary: This paper experiments with frequency-based corpus similarity measures across 39 languages using a register prediction task.
The goal is to quantify (i) the distance between different corpora from the same language and (ii) the homogeneity of individual corpora.
Results show that measures of corpus similarity retain their validity across different language families, writing systems, and types of morphology.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper experiments with frequency-based corpus similarity measures across
39 languages using a register prediction task. The goal is to quantify (i) the
distance between different corpora from the same language and (ii) the
homogeneity of individual corpora. Both of these goals are essential for
measuring how well corpus-based linguistic analysis generalizes from one
dataset to another. The problem is that previous work has focused on
Indo-European languages, raising the question of whether these measures are
able to provide robust generalizations across diverse languages. This paper
uses a register prediction task to evaluate competing measures across 39
languages: how well are they able to distinguish between corpora representing
different contexts of production? Each experiment compares three corpora from a
single language, with the same three digital registers shared across all
languages: social media, web pages, and Wikipedia. Results show that measures
of corpus similarity retain their validity across different language families,
writing systems, and types of morphology. Further, the measures remain robust
when evaluated on out-of-domain corpora, when applied to low-resource
languages, and when applied to different sets of registers. These findings are
significant given our need to make generalizations across the rapidly
increasing number of corpora available for analysis.
Related papers
- Exploring Intra and Inter-language Consistency in Embeddings with ICA [17.87419386215488]
Independent Component Analysis (ICA) creates clearer semantic axes by identifying independent key features.
Previous research has shown ICA's potential to reveal universal semantic axes across languages.
We investigated consistency of semantic axes in two ways: both within a single language and across multiple languages.
arXiv Detail & Related papers (2024-06-18T10:24:50Z) - The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments [57.273662221547056]
In this study, we investigate an unintuitive novel driver of cross-lingual generalisation: language imbalance.
We observe that the existence of a predominant language during training boosts the performance of less frequent languages.
As we extend our analysis to real languages, we find that infrequent languages still benefit from frequent ones, yet whether language imbalance causes cross-lingual generalisation there is not conclusive.
arXiv Detail & Related papers (2024-04-11T17:58:05Z) - Validating and Exploring Large Geographic Corpora [0.76146285961466]
Three methods are used to improve the quality of sub-corpora representing specific language-country pairs like New Zealand English.
The evaluation shows that the validity of sub-corpora is improved with each stage of cleaning but that this improvement is unevenly distributed across languages and populations.
arXiv Detail & Related papers (2024-03-13T02:46:17Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Representations of Language Varieties Are Reliable Given Corpus
Similarity Measures [0.0]
This paper measures similarity both within and between 84 language varieties across nine languages.
The paper shows that there is a consistent agreement between these sources using frequency-based corpus similarity measures.
arXiv Detail & Related papers (2021-04-03T02:19:46Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z) - Fine-Grained Analysis of Cross-Linguistic Syntactic Divergences [18.19093600136057]
We propose a framework for extracting divergence patterns for any language pair from a parallel corpus.
We show that our framework provides a detailed picture of cross-language divergences, generalizes previous approaches, and lends itself to full automation.
arXiv Detail & Related papers (2020-05-07T13:05:03Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Mapping Languages: The Corpus of Global Language Use [0.0]
This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping.
In total, the corpus contains 423 billion words representing 148 languages and 158 countries.
arXiv Detail & Related papers (2020-04-02T03:42:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.