Overlap-based Vocabulary Generation Improves Cross-lingual Transfer
Among Related Languages
- URL: http://arxiv.org/abs/2203.01976v1
- Date: Thu, 3 Mar 2022 19:35:24 GMT
- Title: Overlap-based Vocabulary Generation Improves Cross-lingual Transfer
Among Related Languages
- Authors: Vaidehi Patil, Partha Talukdar, Sunita Sarawagi
- Abstract summary: We argue that relatedness among languages in a language family along the dimension of lexical overlap may be leveraged to overcome some of the corpora limitations of LRLs.
We propose Overlap BPE, a simple yet effective modification to the BPE vocabulary generation algorithm which enhances overlap across related languages.
- Score: 18.862296065737347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained multilingual language models such as mBERT and XLM-R have
demonstrated great potential for zero-shot cross-lingual transfer to low
web-resource languages (LRL). However, due to limited model capacity, the large
difference in the sizes of available monolingual corpora between high
web-resource languages (HRL) and LRLs does not provide enough scope of
co-embedding the LRL with the HRL, thereby affecting downstream task
performance of LRLs. In this paper, we argue that relatedness among languages
in a language family along the dimension of lexical overlap may be leveraged to
overcome some of the corpora limitations of LRLs. We propose Overlap BPE
(OBPE), a simple yet effective modification to the BPE vocabulary generation
algorithm which enhances overlap across related languages. Through extensive
experiments on multiple NLP tasks and datasets, we observe that OBPE generates
a vocabulary that increases the representation of LRLs via tokens shared with
HRLs. This results in improved zero-shot transfer from related HRLs to LRLs
without reducing HRL representation and accuracy. Unlike previous studies that
dismissed the importance of token-overlap, we show that in the low-resource
related language setting, token overlap matters. Synthetically reducing the
overlap to zero can cause as much as a four-fold drop in zero-shot transfer
accuracy.
Related papers
- Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation [62.202893186343935]
We explore what it would take to adapt Large Language Models for low-resource languages.
We show that parallel data is critical during both pre-training andSupervised Fine-Tuning (SFT)
Our experiments with three LLMs across two low-resourced language groups reveal consistent trends, underscoring the generalizability of our findings.
arXiv Detail & Related papers (2024-08-23T00:59:38Z) - Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language Models [12.447489454369636]
This paper evaluates sentence-level hallucination detection approaches using Large Language Models (LLMs) and semantic similarity within massively multilingual embeddings.
LLMs can achieve performance comparable or even better than previously proposed models, despite not being explicitly trained for any machine translation task.
arXiv Detail & Related papers (2024-07-23T13:40:54Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Potential and Limitations of LLMs in Capturing Structured Semantics: A Case Study on SRL [78.80673954827773]
Large Language Models (LLMs) play a crucial role in capturing structured semantics to enhance language understanding, improve interpretability, and reduce bias.
We propose using Semantic Role Labeling (SRL) as a fundamental task to explore LLMs' ability to extract structured semantics.
We find interesting potential: LLMs can indeed capture semantic structures, and scaling-up doesn't always mirror potential.
We are surprised to discover that significant overlap in the errors is made by both LLMs and untrained humans, accounting for almost 30% of all errors.
arXiv Detail & Related papers (2024-05-10T11:44:05Z) - Comparing LLM prompting with Cross-lingual transfer performance on Indigenous and Low-resource Brazilian Languages [5.473562965178709]
We focus on 12 low-resource languages (LRLs) from Brazil, 2 LRLs from Africa and 2 high-resource languages (HRLs)
Our results indicate that the LLMs perform worse for the part of speech (POS) labeling of LRLs in comparison to HRLs.
arXiv Detail & Related papers (2024-04-28T19:24:28Z) - Cross-Lingual Transfer Robustness to Lower-Resource Languages on Adversarial Datasets [4.653113033432781]
Cross-lingual transfer capabilities of Multilingual Language Models (MLLMs) are investigated.
Our research provides valuable insights into cross-lingual transfer and its implications for NLP applications.
arXiv Detail & Related papers (2024-03-29T08:47:15Z) - Enhancing Multilingual Capabilities of Large Language Models through
Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora.
Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z) - Self-Augmentation Improves Zero-Shot Cross-Lingual Transfer [92.80671770992572]
Cross-lingual transfer is a central task in multilingual NLP.
Earlier efforts on this task use parallel corpora, bilingual dictionaries, or other annotated alignment data.
We propose a simple yet effective method, SALT, to improve the zero-shot cross-lingual transfer.
arXiv Detail & Related papers (2023-09-19T19:30:56Z) - CharSpan: Utilizing Lexical Similarity to Enable Zero-Shot Machine
Translation for Extremely Low-resource Languages [22.51558549091902]
We address the task of machine translation (MT) from extremely low-resource language (ELRL) to English by leveraging cross-lingual transfer from 'closely-related' high-resource language (HRL)
Many ELRLs share lexical similarities with some HRLs, which presents a novel modeling opportunity.
Existing subword-based neural MT models do not explicitly harness this lexical similarity, as they only implicitly align HRL and ELRL latent embedding space.
We propose a novel, CharSpan, approach based on 'character-span noise augmentation' into the training data of HRL. This serves as a
arXiv Detail & Related papers (2023-05-09T07:23:01Z) - Exploiting Language Relatedness for Low Web-Resource Language Model
Adaptation: An Indic Languages Study [14.34516262614775]
We argue that relatedness among languages in a language family may be exploited to overcome some of the corpora limitations of LRLs.
We focus on Indian languages, and exploit relatedness along two dimensions: (1) script (since many Indic scripts originated from the Brahmic script) and (2) sentence structure.
arXiv Detail & Related papers (2021-06-07T20:43:02Z) - Improving Target-side Lexical Transfer in Multilingual Neural Machine
Translation [104.10726545151043]
multilingual data has been found more beneficial for NMT models that translate from the LRL to a target language than the ones that translate into the LRLs.
Our experiments show that DecSDE leads to consistent gains of up to 1.8 BLEU on translation from English to four different languages.
arXiv Detail & Related papers (2020-10-04T19:42:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.