Dynamic data sampler for cross-language transfer learning in large language models
- URL: http://arxiv.org/abs/2405.10626v1
- Date: Fri, 17 May 2024 08:40:51 GMT
- Title: Dynamic data sampler for cross-language transfer learning in large language models
- Authors: Yudong Li, Yuhao Feng, Wen Zhou, Zhe Zhao, Linlin Shen, Cheng Hou, Xianxu Hou,
- Abstract summary: ChatFlow is a cross-language transfer-based Large Language Models (LLMs)
We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model.
Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance.
- Score: 34.464472766868106
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have gained significant attention in the field of natural language processing (NLP) due to their wide range of applications. However, training LLMs for languages other than English poses significant challenges, due to the difficulty in acquiring large-scale corpus and the requisite computing resources. In this paper, we propose ChatFlow, a cross-language transfer-based LLM, to address these challenges and train large Chinese language models in a cost-effective manner. We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model, aiming to align cross-language representations and facilitate the knowledge transfer specifically to the Chinese language model. In addition, we use a dynamic data sampler to progressively transition the model from unsupervised pre-training to supervised fine-tuning. Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance. We evaluate ChatFlow on popular Chinese and English benchmarks, the results indicate that it outperforms other Chinese models post-trained on LLaMA-2-7B.
Related papers
- Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - Self-Translate-Train: Enhancing Cross-Lingual Transfer of Large Language Models via Inherent Capability [31.025371443719404]
Self-Translate-Train is a method that lets large language models translate training data into the target language and fine-tunes the model on its own generated data.
By demonstrating that Self-Translate-Train outperforms zero-shot transfer, we encourage further exploration of better methods to elicit cross-lingual capabilities of LLMs.
arXiv Detail & Related papers (2024-06-29T14:40:23Z) - Towards a More Inclusive AI: Progress and Perspectives in Large Language Model Training for the Sámi Language [7.289015788793582]
This work focuses on increasing technological participation for the S'ami language.
We draw the attention of the ML community towards the language modeling problem of Ultra Low Resource (ULR) languages.
We have compiled the available S'ami language resources from the web to create a clean dataset for training language models.
arXiv Detail & Related papers (2024-05-09T13:54:22Z) - Tele-FLM Technical Report [96.19923831660266]
We introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model.
It features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities.
It is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B.
arXiv Detail & Related papers (2024-04-25T14:34:47Z) - Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks.
We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset.
The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z) - Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages.
In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z) - Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca [23.00353889531171]
We propose a method to augment LLaMA with capabilities for understanding and generating Chinese text.
We incorporate secondary pre-training using Chinese data and fine-tune the model with Chinese instruction datasets.
Results on the C-Eval dataset yield competitive performance among the models with several times the size of ours.
arXiv Detail & Related papers (2023-04-17T11:39:53Z) - Cross-lingual Transferring of Pre-trained Contextualized Language Models [73.97131976850424]
We propose a novel cross-lingual model transferring framework for PrLMs: TreLM.
To handle the symbol order and sequence length differences between languages, we propose an intermediate TRILayer" structure.
We show the proposed framework significantly outperforms language models trained from scratch with limited data in both performance and efficiency.
arXiv Detail & Related papers (2021-07-27T06:51:13Z) - Investigating Transfer Learning in Multilingual Pre-trained Language
Models through Chinese Natural Language Inference [11.096793445651313]
We investigate the cross-lingual transfer abilities of XLM-R for Chinese and English natural language inference (NLI)
To better understand linguistic transfer, we created 4 categories of challenge and adversarial tasks for Chinese.
We find that cross-lingual models trained on English NLI do transfer well across our Chinese tasks.
arXiv Detail & Related papers (2021-06-07T22:00:18Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.