Vocabulary Expansion of Chat Models with Unlabeled Target Language Data
- URL: http://arxiv.org/abs/2412.11704v2
- Date: Wed, 18 Dec 2024 12:29:11 GMT
- Title: Vocabulary Expansion of Chat Models with Unlabeled Target Language Data
- Authors: Atsuki Yamaguchi, Terufumi Morishita, Aline Villavicencio, Nikolaos Aletras,
- Abstract summary: Chat models (i.e. language models trained to follow instructions through conversation with humans) outperform base models (i.e. trained solely on unlabeled data) in both conversation and general task-solving abilities.
These models are generally English-centric and require further adaptation for languages that are underrepresented in or absent from their training data.
We propose post-hoc techniques that inject information from the source model without requiring any further training. Experiments reveal the effectiveness of our methods, helping the adapted models to achieve performance improvements in 87% of cases.
- Score: 38.341705137026985
- License:
- Abstract: Chat models (i.e. language models trained to follow instructions through conversation with humans) outperform base models (i.e. trained solely on unlabeled data) in both conversation and general task-solving abilities. These models are generally English-centric and require further adaptation for languages that are underrepresented in or absent from their training data. A common technique for adapting base models is to extend the model's vocabulary with target language tokens, i.e. vocabulary expansion (VE), and then continually pre-train it on language-specific data. Using chat data is ideal for chat model adaptation, but often, either this does not exist or is costly to construct. Alternatively, adapting chat models with unlabeled data is a possible solution, but it could result in catastrophic forgetting. In this paper, we investigate the impact of using unlabeled target language data for VE on chat models for the first time. We first show that off-the-shelf VE generally performs well across target language tasks and models in 71% of cases, though it underperforms in scenarios where source chat models are already strong. To further improve adapted models, we propose post-hoc techniques that inject information from the source model without requiring any further training. Experiments reveal the effectiveness of our methods, helping the adapted models to achieve performance improvements in 87% of cases.
Related papers
- Adapting Large Language Models for Character-based Augmentative and Alternative Communication [8.072353085704629]
Augmentative and Alternative Communication (AAC) users may write letter-by-letter via an interface that uses a character language model.
We investigate how to practically use such models to make accurate and efficient character predictions.
arXiv Detail & Related papers (2025-01-17T22:20:55Z) - Robustifying Language Models with Test-Time Adaptation [17.96043752001886]
Large-scale language models achieved state-of-the-art performance over a number of language tasks.
They fail on adversarial language examples, which are sentences optimized to fool the language models but with similar semantic meanings for humans.
We show that we can reverse many language adversarial attacks by adapting the input sentence with predictions from masked words.
arXiv Detail & Related papers (2023-10-29T22:37:54Z) - Qwen Technical Report [132.54304067403922]
We introduce Qwen, the first installment of our large language model series.
Qwen is the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques.
We have also developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat.
arXiv Detail & Related papers (2023-09-28T17:07:49Z) - Contrastive Alignment of Vision to Language Through Parameter-Efficient
Transfer Learning [60.26952378997713]
Contrastive vision-language models (e.g. CLIP) are created by updating all the parameters of a vision model and language model through contrastive training.
We show that a minimal set of parameter updates ($$7%) can achieve the same performance as full-model training.
We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training.
arXiv Detail & Related papers (2023-03-21T14:12:08Z) - Efficient Language Model Training through Cross-Lingual and Progressive
Transfer Learning [0.7612676127275795]
Most Transformer language models are pretrained on English text.
As model sizes grow, the performance gap between English and other languages increases even further.
We introduce a cross-lingual and progressive transfer learning approach, called CLP-Transfer.
arXiv Detail & Related papers (2023-01-23T18:56:12Z) - Pre-Training a Language Model Without Human Language [74.11825654535895]
We study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance.
We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks.
To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.
arXiv Detail & Related papers (2020-12-22T13:38:06Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.