Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis
- URL: http://arxiv.org/abs/2501.06810v1
- Date: Sun, 12 Jan 2025 13:29:24 GMT
- Title: Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis
- Authors: Minu Kim, Kangwook Jang, Hoirin Kim,
- Abstract summary: This paper examines how linguistic similarity affects cross-lingual phonetic representation in speech processing for low-resource languages.
Using phonologically similar languages consistently achieves a relative improvement of 55.6% over monolingual training.
- Score: 7.751856268560216
- License:
- Abstract: This paper examines how linguistic similarity affects cross-lingual phonetic representation in speech processing for low-resource languages, emphasizing effective source language selection. Previous cross-lingual research has used various source languages to enhance performance for the target low-resource language without thorough consideration of selection. Our study stands out by providing an in-depth analysis of language selection, supported by a practical approach to assess phonetic proximity among multiple language families. We investigate how within-family similarity impacts performance in multilingual training, which aids in understanding language dynamics. We also evaluate the effect of using phonologically similar languages, regardless of family. For the phoneme recognition task, utilizing phonologically similar languages consistently achieves a relative improvement of 55.6% over monolingual training, even surpassing the performance of a large-scale self-supervised learning model. Multilingual training within the same language family demonstrates that higher phonological similarity enhances performance, while lower similarity results in degraded performance compared to monolingual training.
Related papers
- The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments [57.273662221547056]
In this study, we investigate an unintuitive novel driver of cross-lingual generalisation: language imbalance.
We observe that the existence of a predominant language during training boosts the performance of less frequent languages.
As we extend our analysis to real languages, we find that infrequent languages still benefit from frequent ones, yet whether language imbalance causes cross-lingual generalisation there is not conclusive.
arXiv Detail & Related papers (2024-04-11T17:58:05Z) - A Computational Model for the Assessment of Mutual Intelligibility Among
Closely Related Languages [1.5773159234875098]
Closely related languages show linguistic similarities that allow speakers of one language to understand speakers of another language without having actively learned it.
Mutual intelligibility varies in degree and is typically tested in psycholinguistic experiments.
We propose a computer-assisted method using the Linear Discriminative Learner to approximate the cognitive processes by which humans learn languages.
arXiv Detail & Related papers (2024-02-05T11:32:13Z) - GradSim: Gradient-Based Language Grouping for Effective Multilingual
Training [13.730907708289331]
We propose GradSim, a language grouping method based on gradient similarity.
Our experiments on three diverse multilingual benchmark datasets show that it leads to the largest performance gains.
Besides linguistic features, the topics of the datasets play an important role for language grouping.
arXiv Detail & Related papers (2023-10-23T18:13:37Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Investigating the Impact of Cross-lingual Acoustic-Phonetic Similarities
on Multilingual Speech Recognition [31.575930914290762]
A novel data-driven approach is proposed to investigate the cross-lingual acoustic-phonetic similarities.
Deep neural networks are trained as mapping networks to transform the distributions from different acoustic models into a directly comparable form.
A relative improvement of 8% over monolingual counterpart is achieved.
arXiv Detail & Related papers (2022-07-07T15:55:41Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Automatically Identifying Language Family from Acoustic Examples in Low
Resource Scenarios [48.57072884674938]
We propose a method to analyze language similarity using deep learning.
Namely, we train a model on the Wilderness dataset and investigate how its latent space compares with classical language family findings.
arXiv Detail & Related papers (2020-12-01T22:44:42Z) - Rediscovering the Slavic Continuum in Representations Emerging from
Neural Models of Spoken Language Identification [16.369477141866405]
We present a neural model for Slavic language identification in speech signals.
We analyze its emergent representations to investigate whether they reflect objective measures of language relatedness.
arXiv Detail & Related papers (2020-10-22T18:18:19Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.