Chinese Open Instruction Generalist: A Preliminary Release
- URL: http://arxiv.org/abs/2304.07987v4
- Date: Tue, 25 Apr 2023 01:50:19 GMT
- Title: Chinese Open Instruction Generalist: A Preliminary Release
- Authors: Ge Zhang, Yemin Shi, Ruibo Liu, Ruibin Yuan, Yizhi Li, Siwei Dong, Yu
Shu, Zhaoqun Li, Zekun Wang, Chenghua Lin, Wenhao Huang, Jie Fu
- Abstract summary: We propose the project as an attempt to create a Chinese instruction dataset by various methods adapted to the intrinsic characteristics of 4 sub-tasks.
We collect around 200k Chinese instruction tuning samples, which have been manually checked to guarantee high quality.
We summarize the existing English and Chinese instruction corpora and briefly describe some potential applications of the newly constructed Chinese instruction corpora.
- Score: 33.81265396916227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction tuning is widely recognized as a key technique for building
generalist language models, which has attracted the attention of researchers
and the public with the release of InstructGPT~\citep{ouyang2022training} and
ChatGPT\footnote{\url{https://chat.openai.com/}}. Despite impressive progress
in English-oriented large-scale language models (LLMs), it is still
under-explored whether English-based foundation LLMs can perform similarly on
multilingual tasks compared to English tasks with well-designed instruction
tuning and how we can construct the corpora needed for the tuning. To remedy
this gap, we propose the project as an attempt to create a Chinese instruction
dataset by various methods adapted to the intrinsic characteristics of 4
sub-tasks. We collect around 200k Chinese instruction tuning samples, which
have been manually checked to guarantee high quality. We also summarize the
existing English and Chinese instruction corpora and briefly describe some
potential applications of the newly constructed Chinese instruction corpora.
The resulting \textbf{C}hinese \textbf{O}pen \textbf{I}nstruction
\textbf{G}eneralist (\textbf{COIG}) corpora are available in
Huggingface\footnote{\url{https://huggingface.co/datasets/BAAI/COIG}} and
Github\footnote{\url{https://github.com/BAAI-Zlab/COIG}}, and will be
continuously updated.
Related papers
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions [43.90353059292894]
Large language models respond well in high-resource languages like English but struggle in low-resource languages.
We propose a novel method to construct cross-lingual instruction following samples with instruction in English and response in low-resource languages.
arXiv Detail & Related papers (2024-05-30T06:45:23Z) - TIM: Teaching Large Language Models to Translate with Comparison [78.66926087162672]
We propose a novel framework using examples in comparison to teach LLMs to learn translation.
Our approach involves presenting the model with examples of correct and incorrect translations and using a preference loss to guide the model's learning.
Our findings offer a new perspective on fine-tuning LLMs for translation tasks and provide a promising solution for generating high-quality translations.
arXiv Detail & Related papers (2023-07-10T08:15:40Z) - ParroT: Translating during Chat using Large Language Models tuned with
Human Translation and Feedback [90.20262941911027]
ParroT is a framework to enhance and regulate the translation abilities during chat.
Specifically, ParroT reformulates translation data into the instruction-following style.
We propose three instruction types for finetuning ParroT models, including translation instruction, contrastive instruction, and error-guided instruction.
arXiv Detail & Related papers (2023-04-05T13:12:00Z) - Crosslingual Generalization through Multitask Finetuning [80.8822603322471]
Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting.
We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0.
We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages.
arXiv Detail & Related papers (2022-11-03T13:19:32Z) - KLUE: Korean Language Understanding Evaluation [43.94952771238633]
We introduce Korean Language Understanding Evaluation (KLUE) benchmark.
KLUE is a collection of 8 Korean natural language understanding (NLU) tasks.
We build all of the tasks from scratch from diverse source corpora while respecting copyrights.
arXiv Detail & Related papers (2021-05-20T11:40:30Z) - N-LTP: An Open-source Neural Language Technology Platform for Chinese [68.58732970171747]
textttN- is an open-source neural language technology platform supporting six fundamental Chinese NLP tasks.
textttN- adopts the multi-task framework by using a shared pre-trained model, which has the advantage of capturing the shared knowledge across relevant Chinese tasks.
arXiv Detail & Related papers (2020-09-24T11:45:39Z) - CLUE: A Chinese Language Understanding Evaluation Benchmark [41.86950255312653]
We introduce the first large-scale Chinese Language Understanding Evaluation (CLUE) benchmark.
CLUE brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension.
We report scores using an exhaustive set of current state-of-the-art pre-trained Chinese models.
arXiv Detail & Related papers (2020-04-13T15:02:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.