Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
- URL: http://arxiv.org/abs/2304.08177v3
- Date: Fri, 23 Feb 2024 02:22:36 GMT
- Title: Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
- Authors: Yiming Cui, Ziqing Yang, Xin Yao
- Abstract summary: We propose a method to augment LLaMA with capabilities for understanding and generating Chinese text.
We incorporate secondary pre-training using Chinese data and fine-tune the model with Chinese instruction datasets.
Results on the C-Eval dataset yield competitive performance among the models with several times the size of ours.
- Score: 23.00353889531171
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs), such as ChatGPT and GPT-4, have dramatically
transformed natural language processing research and shown promising strides
towards Artificial General Intelligence (AGI). Nonetheless, the high costs
associated with training and deploying LLMs present substantial obstacles to
transparent, accessible academic research. While several large language models,
such as LLaMA, have been open-sourced by the community, these predominantly
focus on English corpora, limiting their usefulness for other languages. In
this paper, we propose a method to augment LLaMA with capabilities for
understanding and generating Chinese text and its ability to follow
instructions. We achieve this by extending LLaMA's existing vocabulary with an
additional 20,000 Chinese tokens, thereby improving its encoding efficiency and
semantic understanding of Chinese. We further incorporate secondary
pre-training using Chinese data and fine-tune the model with Chinese
instruction datasets, significantly enhancing the model's ability to comprehend
and execute instructions. Our experimental results indicate that the newly
proposed model markedly enhances the original LLaMA's proficiency in
understanding and generating Chinese content. Additionally, the results on the
C-Eval dataset yield competitive performance among the models with several
times the size of ours. We have made our pre-trained models, training scripts,
and other resources available through GitHub, fostering open research for our
community. Chinese LLaMA series:
\url{https://github.com/ymcui/Chinese-LLaMA-Alpaca} and Chinese Llama-2 series:
\url{https://github.com/ymcui/Chinese-LLaMA-Alpaca-2}
Related papers
- YuLan: An Open-source Large Language Model [179.59272970659677]
This paper presents the development of YuLan, a series of open-source large language models (LLMs) with $12$ billion parameters.
The base model of YuLan is pre-trained on approximately $1.7$T tokens derived from a diverse corpus, including massive English, Chinese, and multilingual texts.
We devise a curriculum-learning framework throughout across these stages, which helps LLMs learn knowledge in an easy-to-hard manner.
arXiv Detail & Related papers (2024-06-28T11:52:53Z) - Dynamic data sampler for cross-language transfer learning in large language models [34.464472766868106]
ChatFlow is a cross-language transfer-based Large Language Models (LLMs)
We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model.
Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance.
arXiv Detail & Related papers (2024-05-17T08:40:51Z) - Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model [36.01840141194335]
We introduce CT-LLM, a 2B large language model (LLM)
Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by incorporating Chinese textual data.
CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT.
arXiv Detail & Related papers (2024-04-05T15:20:02Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - Skywork: A More Open Bilingual Foundation Model [55.927396986873816]
We present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts.
We show that our model not only excels on popular benchmarks, but also achieves emphstate of the art performance in Chinese language modeling on diverse domains.
arXiv Detail & Related papers (2023-10-30T08:31:47Z) - An Empirical Study of Instruction-tuning Large Language Models in
Chinese [32.5288378307064]
This paper makes an in-depth empirical study of instruction-tuning LLMs in Chinese, which can serve as a cookbook.
Specifically, we systematically explore the impact of LLM bases, parameter-efficient methods, instruction data types.
We also conduct experiment to study the impact of other factors, e.g., chain-of-thought data and human-value alignment.
arXiv Detail & Related papers (2023-10-11T09:18:09Z) - Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages.
In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z) - Investigating Transfer Learning in Multilingual Pre-trained Language
Models through Chinese Natural Language Inference [11.096793445651313]
We investigate the cross-lingual transfer abilities of XLM-R for Chinese and English natural language inference (NLI)
To better understand linguistic transfer, we created 4 categories of challenge and adversarial tasks for Chinese.
We find that cross-lingual models trained on English NLI do transfer well across our Chinese tasks.
arXiv Detail & Related papers (2021-06-07T22:00:18Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.