Mitigating the Linguistic Gap with Phonemic Representations for Robust Cross-lingual Transfer
- URL: http://arxiv.org/abs/2402.14279v3
- Date: Fri, 15 Nov 2024 17:11:08 GMT
- Title: Mitigating the Linguistic Gap with Phonemic Representations for Robust Cross-lingual Transfer
- Authors: Haeji Jung, Changdae Oh, Jooeon Kang, Jimin Sohn, Kyungwoo Song, Jinkyu Kim, David R. Mortensen,
- Abstract summary: Approaches to improving multilingual language understanding often struggle with significant performance gaps between high-resource and low-resource languages.
We present experiments on three representative cross-lingual tasks on 12 languages in total.
Phonemic representations exhibit higher similarities between languages compared to orthographic representations.
- Score: 26.014079273740485
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Approaches to improving multilingual language understanding often struggle with significant performance gaps between high-resource and low-resource languages. While there are efforts to align the languages in a single latent space to mitigate such gaps, how different input-level representations influence such gaps has not been investigated, particularly with phonemic inputs. We hypothesize that the performance gaps are affected by representation discrepancies between these languages, and revisit the use of phonemic representations as a means to mitigate these discrepancies. To demonstrate the effectiveness of phonemic representations, we present experiments on three representative cross-lingual tasks on 12 languages in total. The results show that phonemic representations exhibit higher similarities between languages compared to orthographic representations, and it consistently outperforms grapheme-based baseline model on languages that are relatively low-resourced. We present quantitative evidence from three cross-lingual tasks that demonstrate the effectiveness of phonemic representations, and it is further justified by a theoretical analysis of the cross-lingual performance gap.
Related papers
- High-Dimensional Interlingual Representations of Large Language Models [65.77317753001954]
Large language models (LLMs) trained on massive multilingual datasets hint at the formation of interlingual constructs.
We explore 31 diverse languages varying on their resource-levels, typologies, and geographical regions.
We find that multilingual LLMs exhibit inconsistent cross-lingual alignments.
arXiv Detail & Related papers (2025-03-14T10:39:27Z) - Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis [7.751856268560216]
This paper examines how linguistic similarity affects cross-lingual phonetic representation in speech processing for low-resource languages.
Using phonologically similar languages consistently achieves a relative improvement of 55.6% over monolingual training.
arXiv Detail & Related papers (2025-01-12T13:29:24Z) - Identifying the Correlation Between Language Distance and Cross-Lingual Transfer in a Multilingual Representation Space [6.6635650150737815]
This study examines the absolute evolution of the respective language representation spaces produced by MLLMs.
We place a specific emphasis on the role of linguistic characteristics and investigate their inter-correlation with the impact on representation spaces and cross-lingual transfer performance.
arXiv Detail & Related papers (2023-05-03T14:33:23Z) - Improving Neural Cross-Lingual Summarization via Employing Optimal
Transport Distance for Knowledge Distillation [8.718749742587857]
Cross-lingual summarization models rely on the self-attention mechanism to attend among tokens in two languages.
We propose a novel Knowledge-Distillation-based framework for Cross-Lingual Summarization.
Our method outperforms state-of-the-art models under both high and low-resourced settings.
arXiv Detail & Related papers (2021-12-07T03:45:02Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Specializing Multilingual Language Models: An Empirical Study [50.7526245872855]
Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks.
For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data.
arXiv Detail & Related papers (2021-06-16T18:13:55Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Are Multilingual Models Effective in Code-Switching? [57.78477547424949]
We study the effectiveness of multilingual language models to understand their capability and adaptability to the mixed-language setting.
Our findings suggest that pre-trained multilingual models do not necessarily guarantee high-quality representations on code-switching.
arXiv Detail & Related papers (2021-03-24T16:20:02Z) - On Negative Interference in Multilingual Models: Findings and A
Meta-Learning Treatment [59.995385574274785]
We show that, contrary to previous belief, negative interference also impacts low-resource languages.
We present a meta-learning algorithm that obtains better cross-lingual transferability and alleviates negative interference.
arXiv Detail & Related papers (2020-10-06T20:48:58Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - Learning Robust and Multilingual Speech Representations [38.34632996576116]
We learn representations from up to 8000 hours of diverse and noisy speech data.
We evaluate the representations by looking at their robustness to domain shifts and their ability to improve recognition performance in many languages.
arXiv Detail & Related papers (2020-01-29T23:24:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.