Impact of Tokenization on LLaMa Russian Adaptation
- URL: http://arxiv.org/abs/2312.02598v1
- Date: Tue, 5 Dec 2023 09:16:03 GMT
- Title: Impact of Tokenization on LLaMa Russian Adaptation
- Authors: Mikhail Tikhomirov and Daniil Chernyshev
- Abstract summary: We investigate the possibility of addressing the issue with vocabulary substitution in the context of LLaMa Russian language adaptation.
The results of automatic evaluation show that vocabulary substitution improves the model's quality in Russian.
Additional human evaluation of the instruction-tuned models demonstrates that models with Russian-adapted vocabulary generate answers with higher user preference.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Latest instruction-tuned large language models (LLM) show great results on
various tasks, however, they often face performance degradation for non-English
input. There is evidence that the reason lies in inefficient tokenization
caused by low language representation in pre-training data which hinders the
comprehension of non-English instructions, limiting the potential of target
language instruction-tuning. In this work we investigate the possibility of
addressing the issue with vocabulary substitution in the context of LLaMa
Russian language adaptation. We explore three variants of vocabulary adaptation
and test their performance on Saiga instruction-tuning and fine-tuning on
Russian Super Glue benchmark. The results of automatic evaluation show that
vocabulary substitution not only improves the model's quality in Russian but
also accelerates fine-tuning (35%) and inference (up to 60%) while reducing
memory consumption. Additional human evaluation of the instruction-tuned models
demonstrates that models with Russian-adapted vocabulary generate answers with
higher user preference than the original Saiga-LLaMa model.
Related papers
- Code-Mixed Telugu-English Hate Speech Detection [0.0]
This study investigates transformer-based models, including TeluguHateBERT, HateBERT, DeBERTa, Muril, IndicBERT, Roberta, and Hindi-Abusive-MuRIL, for classifying hate speech in Telugu.
We fine-tune these models using Low-Rank Adaptation (LoRA) to optimize efficiency and performance.
We translate Telugu text into English using Google Translate to assess its impact on classification accuracy.
arXiv Detail & Related papers (2025-02-15T02:03:13Z) - Prune or Retrain: Optimizing the Vocabulary of Multilingual Models for Estonian [0.19116784879310028]
modifying the vocabulary of a multilingual encoder model to better suit the Estonian language affects its downstream performance.
We evaluate the effectiveness of two vocabulary adaptation approaches -- retraining the tokenizer and pruning unused tokens.
arXiv Detail & Related papers (2025-01-05T19:21:45Z) - Challenges in Adapting Multilingual LLMs to Low-Resource Languages using LoRA PEFT Tuning [0.4194295877935868]
This study investigates the effects of Low-Rank Adaptation (LoRA) -Efficient Fine-Tuning (PEFT) on multilingual Gemma models for Marathi.
Using a translated dataset with 52,000 instruction-response pairs, our findings reveal that while evaluation performance decline post-fine-tuning, manual assessments frequently suggest that the fine-tuned models outperform their original counterparts.
arXiv Detail & Related papers (2024-11-27T18:14:38Z) - How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario [72.02391485962127]
Speech Self-Supervised Learning (SSL) models achieve impressive performance on Automatic Speech Recognition (ASR)
In low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages.
We extend a conventional efficient fine-tuning scheme based on the adapter to handle these issues.
arXiv Detail & Related papers (2024-11-27T10:51:00Z) - Analyzing and Reducing the Performance Gap in Cross-Lingual Transfer
with Fine-tuning Slow and Fast [50.19681990847589]
Existing research has shown that a multilingual pre-trained language model fine-tuned with one (source) language also performs well on downstream tasks for non-source languages.
This paper analyzes the fine-tuning process, discovers when the performance gap changes and identifies which network weights affect the overall performance most.
arXiv Detail & Related papers (2023-05-19T06:04:21Z) - Prompt-Tuning Can Be Much Better Than Fine-Tuning on Cross-lingual
Understanding With Multilingual Language Models [95.32691891392903]
In this paper, we do cross-lingual evaluation on various NLU tasks using prompt-tuning and compare it with fine-tuning.
The results show that prompt tuning achieves much better cross-lingual transfer than fine-tuning across datasets.
arXiv Detail & Related papers (2022-10-22T05:48:02Z) - Understanding by Understanding Not: Modeling Negation in Language Models [81.21351681735973]
Negation is a core construction in natural language.
We propose to augment the language modeling objective with an unlikelihood objective that is based on negated generic sentences.
We reduce the mean top1 error rate to 4% on the negated LAMA dataset.
arXiv Detail & Related papers (2021-05-07T21:58:35Z) - Pre-Training a Language Model Without Human Language [74.11825654535895]
We study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance.
We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks.
To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.
arXiv Detail & Related papers (2020-12-22T13:38:06Z) - A Closer Look at Linguistic Knowledge in Masked Language Models: The
Case of Relative Clauses in American English [17.993417004424078]
Transformer-based language models achieve high performance on various tasks, but we still lack understanding of the kind of linguistic knowledge they learn and rely on.
We evaluate three models (BERT, RoBERTa, and ALBERT) testing their grammatical and semantic knowledge by sentence-level probing, diagnostic cases, and masked prediction tasks.
arXiv Detail & Related papers (2020-11-02T13:25:39Z) - Exploring Fine-tuning Techniques for Pre-trained Cross-lingual Models
via Continual Learning [74.25168207651376]
Fine-tuning pre-trained language models to downstream cross-lingual tasks has shown promising results.
We leverage continual learning to preserve the cross-lingual ability of the pre-trained model when we fine-tune it to downstream tasks.
Our methods achieve better performance than other fine-tuning baselines on the zero-shot cross-lingual part-of-speech tagging and named entity recognition tasks.
arXiv Detail & Related papers (2020-04-29T14:07:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.