Related papers: Mitigating the Linguistic Gap with Phonemic Representations for Robust Multilingual Language Understanding

Mitigating the Linguistic Gap with Phonemic Representations for Robust Multilingual Language Understanding

URL: http://arxiv.org/abs/2402.14279v1
Date: Thu, 22 Feb 2024 04:41:52 GMT
Title: Mitigating the Linguistic Gap with Phonemic Representations for Robust Multilingual Language Understanding
Authors: Haeji Jung, Changdae Oh, Jooeon Kang, Jimin Sohn, Kyungwoo Song, Jinkyu Kim, David R. Mortensen
Abstract summary: Performance gaps between languages are affected by linguistic gaps between those languages. We present evidence from three cross-lingual tasks that demonstrate the effectiveness of phonemic representation.
Score: 27.318574025851994
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Approaches to improving multilingual language understanding often require multiple languages during the training phase, rely on complicated training techniques, and -- importantly -- struggle with significant performance gaps between high-resource and low-resource languages. We hypothesize that the performance gaps between languages are affected by linguistic gaps between those languages and provide a novel solution for robust multilingual language modeling by employing phonemic representations (specifically, using phonemes as input tokens to LMs rather than subwords). We present quantitative evidence from three cross-lingual tasks that demonstrate the effectiveness of phonemic representation, which is further justified by a theoretical analysis of the cross-lingual performance gap.

Related papers

High-Dimensional Interlingual Representations of Large Language Models [65.77317753001954]
Large language models (LLMs) trained on massive multilingual datasets hint at the formation of interlingual constructs. We explore 31 diverse languages varying on their resource-levels, typologies, and geographical regions. We find that multilingual LLMs exhibit inconsistent cross-lingual alignments.
arXiv Detail & Related papers (2025-03-14T10:39:27Z)
Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis [7.751856268560216]
This paper examines how linguistic similarity affects cross-lingual phonetic representation in speech processing for low-resource languages. Using phonologically similar languages consistently achieves a relative improvement of 55.6% over monolingual training.
arXiv Detail & Related papers (2025-01-12T13:29:24Z)
Identifying the Correlation Between Language Distance and Cross-Lingual Transfer in a Multilingual Representation Space [6.6635650150737815]
This study examines the absolute evolution of the respective language representation spaces produced by MLLMs. We place a specific emphasis on the role of linguistic characteristics and investigate their inter-correlation with the impact on representation spaces and cross-lingual transfer performance.
arXiv Detail & Related papers (2023-05-03T14:33:23Z)
Improving Neural Cross-Lingual Summarization via Employing Optimal Transport Distance for Knowledge Distillation [8.718749742587857]
Cross-lingual summarization models rely on the self-attention mechanism to attend among tokens in two languages. We propose a novel Knowledge-Distillation-based framework for Cross-Lingual Summarization. Our method outperforms state-of-the-art models under both high and low-resourced settings.
arXiv Detail & Related papers (2021-12-07T03:45:02Z)
Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis. We cluster all the target languages into multiple groups and name each group as a representation sprachbund. Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z)
Specializing Multilingual Language Models: An Empirical Study [50.7526245872855]
Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks. For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data.
arXiv Detail & Related papers (2021-06-16T18:13:55Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
Are Multilingual Models Effective in Code-Switching? [57.78477547424949]
We study the effectiveness of multilingual language models to understand their capability and adaptability to the mixed-language setting. Our findings suggest that pre-trained multilingual models do not necessarily guarantee high-quality representations on code-switching.
arXiv Detail & Related papers (2021-03-24T16:20:02Z)
On Negative Interference in Multilingual Models: Findings and A Meta-Learning Treatment [59.995385574274785]
We show that, contrary to previous belief, negative interference also impacts low-resource languages. We present a meta-learning algorithm that obtains better cross-lingual transferability and alleviates negative interference.
arXiv Detail & Related papers (2020-10-06T20:48:58Z)
Cross-lingual Spoken Language Understanding with Regularized Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource. Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z)
Learning Robust and Multilingual Speech Representations [38.34632996576116]
We learn representations from up to 8000 hours of diverse and noisy speech data. We evaluate the representations by looking at their robustness to domain shifts and their ability to improve recognition performance in many languages.
arXiv Detail & Related papers (2020-01-29T23:24:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.