Related papers: Colexifications for Bootstrapping Cross-lingual Datasets: The Case of Phonology, Concreteness, and Affectiveness

Colexifications for Bootstrapping Cross-lingual Datasets: The Case of Phonology, Concreteness, and Affectiveness

URL: http://arxiv.org/abs/2306.02646v1
Date: Mon, 5 Jun 2023 07:32:21 GMT
Title: Colexifications for Bootstrapping Cross-lingual Datasets: The Case of Phonology, Concreteness, and Affectiveness
Authors: Yiyi Chen, Johannes Bjerva
Abstract summary: Colexification refers to the linguistic phenomenon where a single lexical form is used to convey multiple meanings. We showcase curation procedures which result in a dataset covering 142 languages across 21 language families across the world. The dataset includes ratings of concreteness and affectiveness, mapped with phonemes and phonological features.
Score: 6.790979602996742
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Colexification refers to the linguistic phenomenon where a single lexical form is used to convey multiple meanings. By studying cross-lingual colexifications, researchers have gained valuable insights into fields such as psycholinguistics and cognitive sciences [Jackson et al.,2019]. While several multilingual colexification datasets exist, there is untapped potential in using this information to bootstrap datasets across such semantic features. In this paper, we aim to demonstrate how colexifications can be leveraged to create such cross-lingual datasets. We showcase curation procedures which result in a dataset covering 142 languages across 21 language families across the world. The dataset includes ratings of concreteness and affectiveness, mapped with phonemes and phonological features. We further analyze the dataset along different dimensions to demonstrate potential of the proposed procedures in facilitating further interdisciplinary research in psychology, cognitive science, and multilingual natural language processing (NLP). Based on initial investigations, we observe that i) colexifications that are closer in concreteness/affectiveness are more likely to colexify; ii) certain initial/last phonemes are significantly correlated with concreteness/affectiveness intra language families, such as /k/ as the initial phoneme in both Turkic and Tai-Kadai correlated with concreteness, and /p/ in Dravidian and Sino-Tibetan correlated with Valence; iii) the type-to-token ratio (TTR) of phonemes are positively correlated with concreteness across several language families, while the length of phoneme segments are negatively correlated with concreteness; iv) certain phonological features are negatively correlated with concreteness across languages. The dataset is made public online for further research.

Related papers

Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically [58.019484208091534]
Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs.<n>It remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech.
arXiv Detail & Related papers (2025-05-26T07:21:20Z)
Using Information Theory to Characterize Prosodic Typology: The Case of Tone, Pitch-Accent and Stress-Accent [22.63155507847401]
We predict that languages that use prosody to make lexical distinctions should exhibit a higher mutual information between word identity and prosody, compared to languages that don't.<n>We use a dataset of speakers reading sentences aloud in ten languages across five language families to estimate the mutual information between the text and their pitch curves.
arXiv Detail & Related papers (2025-05-12T15:25:17Z)
ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships [0.0]
Natural Language Inference (NLI) serves as a crucial area within the domain of Natural Language Processing (NLP) This paper focuses on generating a multi-genre Spanish dataset for NLI, ESNLIR, particularly accounting for causal Relationships. The findings signify that the enrichment of genres essentially contributes to the enrichment of the model's capability to generalize.
arXiv Detail & Related papers (2025-03-11T18:32:16Z)
Event Extraction in Basque: Typologically motivated Cross-Lingual Transfer-Learning Analysis [18.25948580496853]
Cross-lingual transfer-learning is widely used in Event Extraction for low-resource languages. This paper studies whether the typological similarity between source and target languages impacts the performance of cross-lingual transfer.
arXiv Detail & Related papers (2024-04-09T15:35:41Z)
Exploring language relations through syntactic distances and geographic proximity [0.4369550829556578]
We explore linguistic distances using series of parts of speech (POS) extracted from the Universal Dependencies dataset. We find definite clusters that correspond to well known language families and groups, with exceptions explained by distinct morphological typologies.
arXiv Detail & Related papers (2024-03-27T10:36:17Z)
SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages [44.017657230247934]
We present textitSemRel, a new semantic relatedness dataset collection annotated by native speakers across 13 languages. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia. Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences.
arXiv Detail & Related papers (2024-02-13T18:04:53Z)
Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z)
A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space. We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance. We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
Phoneme Recognition through Fine Tuning of Phonetic Representations: a Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation. To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda. We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z)
Bridging Linguistic Typology and Multilingual Machine Translation with Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source. We observe that our representations embed typology and strengthen correlations with language relationships. We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages. Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs. Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.