UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset
- URL: http://arxiv.org/abs/2402.04588v2
- Date: Sun, 18 Feb 2024 03:56:45 GMT
- Title: UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset
- Authors: Haoyu Wang, Shuo Wang, Yukun Yan, Xujia Wang, Zhiyu Yang, Yuzhuang Xu,
Zhenghao Liu, Liner Yang, Ning Ding, Xu Han, Zhiyuan Liu, Maosong Sun
- Abstract summary: Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
- Score: 69.33424532827608
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-source large language models (LLMs) have gained significant strength
across diverse fields. Nevertheless, the majority of studies primarily
concentrate on English, with only limited exploration into the realm of
multilingual abilities. In this work, we therefore construct an open-source
multilingual supervised fine-tuning dataset. Different from previous works that
simply translate English instructions, we consider both the language-specific
and language-agnostic abilities of LLMs. Firstly, we introduce a
knowledge-grounded data augmentation approach to elicit more language-specific
knowledge of LLMs, improving their ability to serve users from different
countries. Moreover, we find modern LLMs possess strong cross-lingual transfer
capabilities, thus repeatedly learning identical content in various languages
is not necessary. Consequently, we can substantially prune the
language-agnostic supervised fine-tuning (SFT) data without any performance
degradation, making multilingual SFT more efficient. The resulting UltraLink
dataset comprises approximately 1 million samples across five languages (i.e.,
En, Zh, Ru, Fr, Es), and the proposed data construction method can be easily
extended to other languages. UltraLink-LM, which is trained on UltraLink,
outperforms several representative baselines across many tasks.
Related papers
- Improving Multilingual Instruction Finetuning via Linguistically Natural and Diverse Datasets [38.867815476721894]
Most Instruction Fine-Tuning (IFT) datasets are predominantly in English, limiting model performance in other languages.
Traditional methods for creating multilingual IFT datasets struggle to capture linguistic nuances and ensure prompt (instruction) diversity.
We propose a novel method for collecting multilingual IFT datasets that preserves linguistic naturalness and ensures prompt diversity.
arXiv Detail & Related papers (2024-07-01T23:47:09Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners [67.85635044939836]
Large Language Models (LLMs) have shown impressive language capabilities.
In this work, we investigate the spontaneous multilingual alignment improvement of LLMs.
We find that LLMs instruction-tuned on the question translation data (i.e. without annotated answers) are able to encourage the alignment between English and a wide range of languages.
arXiv Detail & Related papers (2024-05-22T16:46:19Z) - Mitigating Language-Level Performance Disparity in mPLMs via Teacher Language Selection and Cross-lingual Self-Distillation [25.850573463743352]
Large-scale multilingual Pretrained Language Models (mPLMs) yield impressive performance on cross-language tasks.
Yet significant performance disparities exist across different languages within the same mPLM.
We introduce ALSACE to leverage the learned knowledge from the well-performing languages to guide under-performing ones within the same mPLM.
arXiv Detail & Related papers (2024-04-12T14:19:16Z) - Enhancing Multilingual Capabilities of Large Language Models through
Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora.
Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - Okapi: Instruction-tuned Large Language Models in Multiple Languages
with Reinforcement Learning from Human Feedback [61.83548032416181]
We present Okapi, the first system with instruction-tuned LLMs based on RLHF for multiple languages.
Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research.
arXiv Detail & Related papers (2023-07-29T18:01:46Z) - Bootstrapping Multilingual Semantic Parsers using Large Language Models [28.257114724384806]
translate-train paradigm of transferring English datasets across multiple languages remains to be the key ingredient for training task-specific multilingual models.
We consider the task of multilingual semantic parsing and demonstrate the effectiveness and flexibility offered by large language models (LLMs) for translating English datasets into several languages via few-shot prompting.
arXiv Detail & Related papers (2022-10-13T19:34:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.