Related papers: Impact of Tokenization on LLaMa Russian Adaptation

Impact of Tokenization on LLaMa Russian Adaptation

URL: http://arxiv.org/abs/2312.02598v1
Date: Tue, 5 Dec 2023 09:16:03 GMT
Title: Impact of Tokenization on LLaMa Russian Adaptation
Authors: Mikhail Tikhomirov and Daniil Chernyshev
Abstract summary: We investigate the possibility of addressing the issue with vocabulary substitution in the context of LLaMa Russian language adaptation. The results of automatic evaluation show that vocabulary substitution improves the model's quality in Russian. Additional human evaluation of the instruction-tuned models demonstrates that models with Russian-adapted vocabulary generate answers with higher user preference.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Latest instruction-tuned large language models (LLM) show great results on various tasks, however, they often face performance degradation for non-English input. There is evidence that the reason lies in inefficient tokenization caused by low language representation in pre-training data which hinders the comprehension of non-English instructions, limiting the potential of target language instruction-tuning. In this work we investigate the possibility of addressing the issue with vocabulary substitution in the context of LLaMa Russian language adaptation. We explore three variants of vocabulary adaptation and test their performance on Saiga instruction-tuning and fine-tuning on Russian Super Glue benchmark. The results of automatic evaluation show that vocabulary substitution not only improves the model's quality in Russian but also accelerates fine-tuning (35%) and inference (up to 60%) while reducing memory consumption. Additional human evaluation of the instruction-tuned models demonstrates that models with Russian-adapted vocabulary generate answers with higher user preference than the original Saiga-LLaMa model.

Related papers

Code-Mixed Telugu-English Hate Speech Detection [0.0]
This study investigates transformer-based models, including TeluguHateBERT, HateBERT, DeBERTa, Muril, IndicBERT, Roberta, and Hindi-Abusive-MuRIL, for classifying hate speech in Telugu. We fine-tune these models using Low-Rank Adaptation (LoRA) to optimize efficiency and performance. We translate Telugu text into English using Google Translate to assess its impact on classification accuracy.
arXiv Detail & Related papers (2025-02-15T02:03:13Z)
Prune or Retrain: Optimizing the Vocabulary of Multilingual Models for Estonian [0.19116784879310028]
modifying the vocabulary of a multilingual encoder model to better suit the Estonian language affects its downstream performance. We evaluate the effectiveness of two vocabulary adaptation approaches -- retraining the tokenizer and pruning unused tokens.
arXiv Detail & Related papers (2025-01-05T19:21:45Z)
Challenges in Adapting Multilingual LLMs to Low-Resource Languages using LoRA PEFT Tuning [0.4194295877935868]
This study investigates the effects of Low-Rank Adaptation (LoRA) -Efficient Fine-Tuning (PEFT) on multilingual Gemma models for Marathi. Using a translated dataset with 52,000 instruction-response pairs, our findings reveal that while evaluation performance decline post-fine-tuning, manual assessments frequently suggest that the fine-tuned models outperform their original counterparts.
arXiv Detail & Related papers (2024-11-27T18:14:38Z)
How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario [72.02391485962127]
Speech Self-Supervised Learning (SSL) models achieve impressive performance on Automatic Speech Recognition (ASR) In low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages. We extend a conventional efficient fine-tuning scheme based on the adapter to handle these issues.
arXiv Detail & Related papers (2024-11-27T10:51:00Z)
Robustifying Language Models with Test-Time Adaptation [17.96043752001886]
Large-scale language models achieved state-of-the-art performance over a number of language tasks. They fail on adversarial language examples, which are sentences optimized to fool the language models but with similar semantic meanings for humans. We show that we can reverse many language adversarial attacks by adapting the input sentence with predictions from masked words.
arXiv Detail & Related papers (2023-10-29T22:37:54Z)
Tokenizer Choice For LLM Training: Negligible or Crucial? [30.33170936148845]
We study the influence of tokenizer choice on Large Language Models (LLMs) downstream performance by training 24 mono- and multilingual LLMs. We find that the tokenizer choice can significantly impact the model's downstream performance and training costs. We show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English.
arXiv Detail & Related papers (2023-10-12T22:44:19Z)
Analyzing and Reducing the Performance Gap in Cross-Lingual Transfer with Fine-tuning Slow and Fast [50.19681990847589]
Existing research has shown that a multilingual pre-trained language model fine-tuned with one (source) language also performs well on downstream tasks for non-source languages. This paper analyzes the fine-tuning process, discovers when the performance gap changes and identifies which network weights affect the overall performance most.
arXiv Detail & Related papers (2023-05-19T06:04:21Z)
Prompt-Tuning Can Be Much Better Than Fine-Tuning on Cross-lingual Understanding With Multilingual Language Models [95.32691891392903]
In this paper, we do cross-lingual evaluation on various NLU tasks using prompt-tuning and compare it with fine-tuning. The results show that prompt tuning achieves much better cross-lingual transfer than fine-tuning across datasets.
arXiv Detail & Related papers (2022-10-22T05:48:02Z)
Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers [54.4919139401528]
We show that it is possible to reduce interference by identifying and pruning language-specific parameters. We show that removing identified attention heads from a fixed model improves performance for a target language on both sentence classification and structural prediction.
arXiv Detail & Related papers (2022-10-11T18:11:37Z)
Understanding by Understanding Not: Modeling Negation in Language Models [81.21351681735973]
Negation is a core construction in natural language. We propose to augment the language modeling objective with an unlikelihood objective that is based on negated generic sentences. We reduce the mean top1 error rate to 4% on the negated LAMA dataset.
arXiv Detail & Related papers (2021-05-07T21:58:35Z)
Pre-Training a Language Model Without Human Language [74.11825654535895]
We study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance. We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks. To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.
arXiv Detail & Related papers (2020-12-22T13:38:06Z)
A Closer Look at Linguistic Knowledge in Masked Language Models: The Case of Relative Clauses in American English [17.993417004424078]
Transformer-based language models achieve high performance on various tasks, but we still lack understanding of the kind of linguistic knowledge they learn and rely on. We evaluate three models (BERT, RoBERTa, and ALBERT) testing their grammatical and semantic knowledge by sentence-level probing, diagnostic cases, and masked prediction tasks.
arXiv Detail & Related papers (2020-11-02T13:25:39Z)
Pronoun-Targeted Fine-tuning for NMT with Hybrid Losses [6.596002578395152]
We introduce a class of conditional generative-discriminative hybrid losses that we use to fine-tune a trained machine translation model. We improve the model performance of both a sentence-level and a contextual model without using any additional data. Our sentence-level model shows a 0.5 BLEU improvement on both the WMT14 and the IWSLT13 De-En testsets. Our contextual model achieves the best results, improving from 31.81 to 32 BLEU on WMT14 De-En testset, and from 32.10 to 33.13 on the IWSLT13 De-En
arXiv Detail & Related papers (2020-10-15T10:11:40Z)
Exploring Fine-tuning Techniques for Pre-trained Cross-lingual Models via Continual Learning [74.25168207651376]
Fine-tuning pre-trained language models to downstream cross-lingual tasks has shown promising results. We leverage continual learning to preserve the cross-lingual ability of the pre-trained model when we fine-tune it to downstream tasks. Our methods achieve better performance than other fine-tuning baselines on the zero-shot cross-lingual part-of-speech tagging and named entity recognition tasks.
arXiv Detail & Related papers (2020-04-29T14:07:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.