LLaMA Beyond English: An Empirical Study on Language Capability Transfer
- URL: http://arxiv.org/abs/2401.01055v2
- Date: Fri, 12 Jan 2024 08:14:12 GMT
- Title: LLaMA Beyond English: An Empirical Study on Language Capability Transfer
- Authors: Jun Zhao, Zhihao Zhang, Luhui Gao, Qi Zhang, Tao Gui, Xuanjing Huang
- Abstract summary: We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language.
We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer.
We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
- Score: 49.298360366468934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent times, substantial advancements have been witnessed in large
language models (LLMs), exemplified by ChatGPT, showcasing remarkable
proficiency across a range of complex tasks. However, many mainstream LLMs
(e.g. LLaMA) are pretrained on English-dominant corpus, which limits their
performance in other non-English languages. In this paper, we focus on how to
effectively transfer the capabilities of language generation and following
instructions to a non-English language. To answer this question, we conduct an
extensive empirical investigation based on LLaMA, accumulating over 1440 GPU
hours. We analyze the impact of key factors such as vocabulary extension,
further pretraining, and instruction tuning on transfer. To accurately assess
the model's level of knowledge, we employ four widely used standardized testing
benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench. Furthermore, a
comprehensive evaluation of the model's response quality is conducted,
considering aspects such as accuracy, fluency, informativeness, logical
coherence, and harmlessness, based on LLM-Eval, a benchmarks consisting
instruction tasks from 17 diverse categories. Our evaluation results
demonstrate that comparable performance to state-of-the-art transfer models can
be achieved with less than 1% of the pretraining data, both in terms of
knowledge alignment and response quality. Furthermore, the experimental
outcomes across the thirteen low-resource languages also exhibit similar
trends. We anticipate that the conclusions revealed by the experiments will aid
the community in developing non-English LLMs.
Related papers
- Disce aut Deficere: Evaluating LLMs Proficiency on the INVALSI Italian Benchmark [12.729687989535359]
evaluating Large Language Models (LLMs) in languages other than English is crucial for ensuring their linguistic versatility, cultural relevance, and applicability in diverse global contexts.
We tackle this challenge by introducing a structured benchmark using the INVALSI tests, a set of well-established assessments designed to measure educational competencies across Italy.
arXiv Detail & Related papers (2024-06-25T13:20:08Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - Analyzing and Adapting Large Language Models for Few-Shot Multilingual
NLU: Are We There Yet? [82.02076369811402]
Supervised fine-tuning (SFT), supervised instruction tuning (SIT) and in-context learning (ICL) are three alternative, de facto standard approaches to few-shot learning.
We present an extensive and systematic comparison of the three approaches, testing them on 6 high- and low-resource languages, three different NLU tasks, and a myriad of language and domain setups.
Our observations show that supervised instruction tuning has the best trade-off between performance and resource requirements.
arXiv Detail & Related papers (2024-03-04T10:48:13Z) - An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference [38.1823640848362]
State-of-the-art generative large language models (LLMs) disproportionately relies on English-centric tokenizers, vocabulary and pre-training data.
Recent studies have shown that their inference efficiency deteriorates when generating text in languages other than English.
arXiv Detail & Related papers (2024-02-16T14:15:15Z) - Are Large Language Models Good Fact Checkers: A Preliminary Study [26.023148371263012]
Large Language Models (LLMs) have drawn significant attention due to their outstanding reasoning capabilities and extensive knowledge repository.
This study aims to comprehensively evaluate various LLMs in tackling specific fact-checking subtasks.
arXiv Detail & Related papers (2023-11-29T05:04:52Z) - CMMLU: Measuring massive multitask language understanding in Chinese [133.70911295934746]
This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities.
CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
arXiv Detail & Related papers (2023-06-15T15:49:51Z) - Towards a Common Understanding of Contributing Factors for Cross-Lingual
Transfer in Multilingual Language Models: A Review [2.578242050187029]
Pre-trained Multilingual Language Models (MLLMs) have shown a strong ability to transfer knowledge across different languages.
It is challenging to obtain a unique and straightforward explanation for its emergence.
This review provides, first, an aligned reference point for future research and, second, guidance for a better-informed and more efficient way of leveraging the cross-lingual capacity of MLLMs.
arXiv Detail & Related papers (2023-05-26T09:31:12Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.