Around the world in 60 words: A generative vocabulary test for online
research
- URL: http://arxiv.org/abs/2302.01614v1
- Date: Fri, 3 Feb 2023 09:27:12 GMT
- Title: Around the world in 60 words: A generative vocabulary test for online
research
- Authors: Pol van Rijn, Yue Sun, Harin Lee, Raja Marjieh, Ilia Sucholutsky,
Francesca Lanzarini, Elisabeth Andr\'e, Nori Jacoby
- Abstract summary: We present an automated pipeline to generate vocabulary tests using text from Wikipedia.
Our pipeline samples rare nouns and creates pseudowords with the same low-level statistics.
Our test, available in eight languages, can easily be extended to other languages.
- Score: 12.91296932597502
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conducting experiments with diverse participants in their native languages
can uncover insights into culture, cognition, and language that may not be
revealed otherwise. However, conducting these experiments online makes it
difficult to validate self-reported language proficiency. Furthermore, existing
proficiency tests are small and cover only a few languages. We present an
automated pipeline to generate vocabulary tests using text from Wikipedia. Our
pipeline samples rare nouns and creates pseudowords with the same low-level
statistics. Six behavioral experiments (N=236) in six countries and eight
languages show that (a) our test can distinguish between native speakers of
closely related languages, (b) the test is reliable ($r=0.82$), and (c)
performance strongly correlates with existing tests (LexTale) and self-reports.
We further show that test accuracy is negatively correlated with the linguistic
distance between the tested and the native language. Our test, available in
eight languages, can easily be extended to other languages.
Related papers
- Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.
We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.
We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z) - The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments [57.273662221547056]
In this study, we investigate an unintuitive novel driver of cross-lingual generalisation: language imbalance.
We observe that the existence of a predominant language during training boosts the performance of less frequent languages.
As we extend our analysis to real languages, we find that infrequent languages still benefit from frequent ones, yet whether language imbalance causes cross-lingual generalisation there is not conclusive.
arXiv Detail & Related papers (2024-04-11T17:58:05Z) - A Computational Model for the Assessment of Mutual Intelligibility Among
Closely Related Languages [1.5773159234875098]
Closely related languages show linguistic similarities that allow speakers of one language to understand speakers of another language without having actively learned it.
Mutual intelligibility varies in degree and is typically tested in psycholinguistic experiments.
We propose a computer-assisted method using the Linear Discriminative Learner to approximate the cognitive processes by which humans learn languages.
arXiv Detail & Related papers (2024-02-05T11:32:13Z) - Matching Tweets With Applicable Fact-Checks Across Languages [27.762055254009017]
We focus on automatically finding existing fact-checks for claims made in social media posts (tweets)
We conduct both classification and retrieval experiments, in monolingual (English only), multilingual (Spanish, Portuguese), and cross-lingual (Hindi-English) settings.
We present promising results for "match" classification (93% average accuracy) in four language pairs.
arXiv Detail & Related papers (2022-02-14T23:33:02Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - MultiAzterTest: a Multilingual Analyzer on Multiple Levels of Language
for Readability Assessment [0.0]
MultiAzterTest is an open source NLP tool that analyzes texts on over 125 measures of cohesion,language, and readability for English, Spanish and Basque.
Using cross-lingual features, MultiAzterTest also obtains competitive results above all in a complex vs simple distinction.
arXiv Detail & Related papers (2021-09-10T13:34:52Z) - Improving Multilingual Models with Language-Clustered Vocabularies [8.587129426070979]
We introduce a novel procedure for multilingual vocabulary generation that combines the separately trained vocabularies of several automatically derived language clusters.
Our experiments show improvements across languages on key multilingual benchmark tasks.
arXiv Detail & Related papers (2020-10-24T04:49:15Z) - Knowledge Distillation for Multilingual Unsupervised Neural Machine
Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs.
UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time.
In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z) - Information-Theoretic Probing for Linguistic Structure [74.04862204427944]
We propose an information-theoretic operationalization of probing as estimating mutual information.
We evaluate on a set of ten typologically diverse languages often underrepresented in NLP research.
arXiv Detail & Related papers (2020-04-07T01:06:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.