Representations of Language Varieties Are Reliable Given Corpus
Similarity Measures
- URL: http://arxiv.org/abs/2104.01294v1
- Date: Sat, 3 Apr 2021 02:19:46 GMT
- Title: Representations of Language Varieties Are Reliable Given Corpus
Similarity Measures
- Authors: Jonathan Dunn
- Abstract summary: This paper measures similarity both within and between 84 language varieties across nine languages.
The paper shows that there is a consistent agreement between these sources using frequency-based corpus similarity measures.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper measures similarity both within and between 84 language varieties
across nine languages. These corpora are drawn from digital sources (the web
and tweets), allowing us to evaluate whether such geo-referenced corpora are
reliable for modelling linguistic variation. The basic idea is that, if each
source adequately represents a single underlying language variety, then the
similarity between these sources should be stable across all languages and
countries. The paper shows that there is a consistent agreement between these
sources using frequency-based corpus similarity measures. This provides further
evidence that digital geo-referenced corpora consistently represent local
language varieties.
Related papers
- Multilingual Conceptual Coverage in Text-to-Image Models [98.80343331645626]
"Conceptual Coverage Across Languages" (CoCo-CroLa) is a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns.
For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language.
arXiv Detail & Related papers (2023-06-02T17:59:09Z) - PESTS: Persian_English Cross Lingual Corpus for Semantic Textual Similarity [5.439505575097552]
Cross lingual semantic similarity models use a machine translation due to the unavailability of cross lingual semantic similarity dataset.
For Persian, which is one of the low resource languages, the need for a model that can understand the context of two languages is felt more than ever.
In this article, the corpus of semantic similarity between sentences in Persian and English languages has been produced for the first time by using linguistic experts.
arXiv Detail & Related papers (2023-05-13T11:02:50Z) - The Geometry of Multilingual Language Models: An Equality Lens [2.6746119935689214]
We analyze the geometry of three multilingual language models in Euclidean space.
Using a geometric separability index we find that although languages tend to be closer according to their linguistic family, they are almost separable with languages from other families.
arXiv Detail & Related papers (2023-05-13T05:19:15Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Corpus Similarity Measures Remain Robust Across Diverse Languages [0.0]
This paper experiments with frequency-based corpus similarity measures across 39 languages using a register prediction task.
The goal is to quantify (i) the distance between different corpora from the same language and (ii) the homogeneity of individual corpora.
Results show that measures of corpus similarity retain their validity across different language families, writing systems, and types of morphology.
arXiv Detail & Related papers (2022-06-09T08:17:16Z) - Linking Emergent and Natural Languages via Corpus Transfer [98.98724497178247]
We propose a novel way to establish a link by corpus transfer between emergent languages and natural languages.
Our approach showcases non-trivial transfer benefits for two different tasks -- language modeling and image captioning.
We also introduce a novel metric to predict the transferability of an emergent language by translating emergent messages to natural language captions grounded on the same images.
arXiv Detail & Related papers (2022-03-24T21:24:54Z) - Bilingual Topic Models for Comparable Corpora [9.509416095106491]
We propose a binding mechanism between the distributions of the paired documents.
To estimate the similarity of documents that are written in different languages we use cross-lingual word embeddings that are learned with shallow neural networks.
We evaluate the proposed binding mechanism by extending two topic models: a bilingual adaptation of LDA that assumes bag-of-words inputs and a model that incorporates part of the text structure in the form of boundaries of semantically coherent segments.
arXiv Detail & Related papers (2021-11-30T10:53:41Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Mapping Languages: The Corpus of Global Language Use [0.0]
This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping.
In total, the corpus contains 423 billion words representing 148 languages and 158 countries.
arXiv Detail & Related papers (2020-04-02T03:42:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.