Colexifications for Bootstrapping Cross-lingual Datasets: The Case of
Phonology, Concreteness, and Affectiveness
- URL: http://arxiv.org/abs/2306.02646v1
- Date: Mon, 5 Jun 2023 07:32:21 GMT
- Title: Colexifications for Bootstrapping Cross-lingual Datasets: The Case of
Phonology, Concreteness, and Affectiveness
- Authors: Yiyi Chen, Johannes Bjerva
- Abstract summary: Colexification refers to the linguistic phenomenon where a single lexical form is used to convey multiple meanings.
We showcase curation procedures which result in a dataset covering 142 languages across 21 language families across the world.
The dataset includes ratings of concreteness and affectiveness, mapped with phonemes and phonological features.
- Score: 6.790979602996742
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Colexification refers to the linguistic phenomenon where a single lexical
form is used to convey multiple meanings. By studying cross-lingual
colexifications, researchers have gained valuable insights into fields such as
psycholinguistics and cognitive sciences [Jackson et al.,2019]. While several
multilingual colexification datasets exist, there is untapped potential in
using this information to bootstrap datasets across such semantic features. In
this paper, we aim to demonstrate how colexifications can be leveraged to
create such cross-lingual datasets. We showcase curation procedures which
result in a dataset covering 142 languages across 21 language families across
the world. The dataset includes ratings of concreteness and affectiveness,
mapped with phonemes and phonological features. We further analyze the dataset
along different dimensions to demonstrate potential of the proposed procedures
in facilitating further interdisciplinary research in psychology, cognitive
science, and multilingual natural language processing (NLP). Based on initial
investigations, we observe that i) colexifications that are closer in
concreteness/affectiveness are more likely to colexify; ii) certain
initial/last phonemes are significantly correlated with
concreteness/affectiveness intra language families, such as /k/ as the initial
phoneme in both Turkic and Tai-Kadai correlated with concreteness, and /p/ in
Dravidian and Sino-Tibetan correlated with Valence; iii) the type-to-token
ratio (TTR) of phonemes are positively correlated with concreteness across
several language families, while the length of phoneme segments are negatively
correlated with concreteness; iv) certain phonological features are negatively
correlated with concreteness across languages. The dataset is made public
online for further research.
Related papers
- ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships [0.0]
Natural Language Inference (NLI) serves as a crucial area within the domain of Natural Language Processing (NLP)
This paper focuses on generating a multi-genre Spanish dataset for NLI, ESNLIR, particularly accounting for causal Relationships.
The findings signify that the enrichment of genres essentially contributes to the enrichment of the model's capability to generalize.
arXiv Detail & Related papers (2025-03-11T18:32:16Z) - Event Extraction in Basque: Typologically motivated Cross-Lingual Transfer-Learning Analysis [18.25948580496853]
Cross-lingual transfer-learning is widely used in Event Extraction for low-resource languages.
This paper studies whether the typological similarity between source and target languages impacts the performance of cross-lingual transfer.
arXiv Detail & Related papers (2024-04-09T15:35:41Z) - Exploring language relations through syntactic distances and geographic proximity [0.4369550829556578]
We explore linguistic distances using series of parts of speech (POS) extracted from the Universal Dependencies dataset.
We find definite clusters that correspond to well known language families and groups, with exceptions explained by distinct morphological typologies.
arXiv Detail & Related papers (2024-03-27T10:36:17Z) - SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages [44.017657230247934]
We present textitSemRel, a new semantic relatedness dataset collection annotated by native speakers across 13 languages.
These languages originate from five distinct language families and are predominantly spoken in Africa and Asia.
Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences.
arXiv Detail & Related papers (2024-02-13T18:04:53Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Phoneme Recognition through Fine Tuning of Phonetic Representations: a
Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation.
To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda.
We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.