Training a Bilingual Language Model by Mapping Tokens onto a Shared
Character Space
- URL: http://arxiv.org/abs/2402.16065v1
- Date: Sun, 25 Feb 2024 11:26:39 GMT
- Title: Training a Bilingual Language Model by Mapping Tokens onto a Shared
Character Space
- Authors: Aviad Rom and Kfir Bar
- Abstract summary: We train a bilingual Arabic-Hebrew language model using a transliterated version of Arabic texts in Hebrew.
We assess the performance of a language model that employs a unified script for both languages, on machine translation.
- Score: 2.9914612342004503
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We train a bilingual Arabic-Hebrew language model using a transliterated
version of Arabic texts in Hebrew, to ensure both languages are represented in
the same script. Given the morphological, structural similarities, and the
extensive number of cognates shared among Arabic and Hebrew, we assess the
performance of a language model that employs a unified script for both
languages, on machine translation which requires cross-lingual knowledge. The
results are promising: our model outperforms a contrasting model which keeps
the Arabic texts in the Arabic script, demonstrating the efficacy of the
transliteration step. Despite being trained on a dataset approximately 60%
smaller than that of other existing language models, our model appears to
deliver comparable performance in machine translation across both translation
directions.
Related papers
- Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks [17.5987429821102]
Swan is a family of embedding models centred around the Arabic language.
Two variants: Swan-Small, based on ARBERTv2, and Swan-Large, built on ArMistral, a pretrained Arabic large language model.
arXiv Detail & Related papers (2024-11-02T09:39:49Z) - ALLaM: Large Language Models for Arabic and English [9.881560166505452]
We present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT)
Our autoregressive decoder-only architecture models demonstrate how second-language acquisition via vocabulary expansion and pretraining can steer a model towards a new language (Arabic) without any catastrophic forgetting in the original language (English)
We show that extensive alignment with human preferences can significantly enhance the performance of a language model compared to models of a larger scale with lower quality alignment.
arXiv Detail & Related papers (2024-07-22T05:35:17Z) - Bilingual Adaptation of Monolingual Foundation Models [48.859227944759986]
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language.
Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix.
By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic.
arXiv Detail & Related papers (2024-07-13T21:09:38Z) - Decomposed Prompting for Machine Translation Between Related Languages
using Large Language Models [55.35106713257871]
We introduce DecoMT, a novel approach of few-shot prompting that decomposes the translation process into a sequence of word chunk translations.
We show that DecoMT outperforms the strong few-shot prompting BLOOM model with an average improvement of 8 chrF++ scores across the examined languages.
arXiv Detail & Related papers (2023-05-22T14:52:47Z) - Investigating Lexical Sharing in Multilingual Machine Translation for
Indian Languages [8.858671209228536]
We investigate lexical sharing in multilingual machine translation from Hindi, Gujarati, Nepali into English.
We find that transliteration does not give pronounced improvements.
Our analysis suggests that our multilingual MT models trained on original scripts seem to already be robust to cross-script differences.
arXiv Detail & Related papers (2023-05-04T23:35:15Z) - "Wikily" Neural Machine Translation Tailored to Cross-Lingual Tasks [20.837515947519524]
First sentences and titles of linked Wikipedia pages, as well as cross-lingual image captions, are strong signals for a seed parallel data to extract bilingual dictionaries and cross-lingual word embeddings for mining parallel text from Wikipedia.
In image captioning, we train a multi-tasking machine translation and image captioning pipeline for Arabic and English from which the Arabic training data is a wikily translation of the English captioning data.
Our captioning results in Arabic are slightly better than that of its supervised model.
arXiv Detail & Related papers (2021-04-16T21:49:12Z) - How Good is Your Tokenizer? On the Monolingual Performance of
Multilingual Language Models [96.32118305166412]
We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks.
We find that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts.
arXiv Detail & Related papers (2020-12-31T14:11:00Z) - AraELECTRA: Pre-Training Text Discriminators for Arabic Language
Understanding [0.0]
We develop an Arabic language representation model, which we name AraELECTRA.
Our model is pretrained using the replaced token detection objective on large Arabic text corpora.
We show that AraELECTRA outperforms current state-of-the-art Arabic language representation models, given the same pretraining data and with even a smaller model size.
arXiv Detail & Related papers (2020-12-31T09:35:39Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.