From Base to Conversational: Japanese Instruction Dataset and Tuning
Large Language Models
- URL: http://arxiv.org/abs/2309.03412v2
- Date: Sun, 5 Nov 2023 06:32:30 GMT
- Title: From Base to Conversational: Japanese Instruction Dataset and Tuning
Large Language Models
- Authors: Masahiro Suzuki, Masanori Hirano, Hiroki Sakaji
- Abstract summary: We construct a Japanese instruction dataset by expanding and filtering existing datasets.
We perform Low-Rank Adaptation (LoRA) tuning on both Japanese and English existing models.
- Score: 6.520584613661788
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Instruction tuning is essential for large language models (LLMs) to become
interactive. While many instruction tuning datasets exist in English, there is
a noticeable lack in other languages. Also, their effectiveness has not been
well verified in non-English languages. We construct a Japanese instruction
dataset by expanding and filtering existing datasets and apply the dataset to a
Japanese pre-trained base model. We performed Low-Rank Adaptation (LoRA) tuning
on both Japanese and English existing models using our instruction dataset. We
evaluated these models from both quantitative and qualitative perspectives. As
a result, the effectiveness of Japanese instruction datasets is confirmed. The
results also indicate that even with relatively small LLMs, performances in
downstream tasks would be improved through instruction tuning. Our instruction
dataset, tuned models, and implementation are publicly available online.
Related papers
- MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions [54.08017526771947]
Multilingual Reverse Instructions (MURI) generates high-quality instruction tuning datasets for low-resource languages.
MURI produces instruction-output pairs from existing human-written texts in low-resource languages.
Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages.
arXiv Detail & Related papers (2024-09-19T17:59:20Z) - Towards Better Monolingual Japanese Retrievers with Multi-Vector Models [0.0]
In Japanese, the best performing deep-learning based retrieval approaches rely on multilingual dense embedders.
We introduce JaColBERT, a family of multi-vector retrievers trained on two magnitudes fewer data than their multilingual counterparts.
arXiv Detail & Related papers (2023-12-26T18:07:05Z) - Multi-dimensional data refining strategy for effective fine-tuning LLMs [2.67766280323297]
This paper presents lessons learned while crawling and refining data tailored for fine-tuning Vietnamese language models.
Our paper presents a multidimensional strategy including leveraging existing datasets in the English language and developing customized data-crawling scripts with the assistance of generative AI tools.
arXiv Detail & Related papers (2023-11-02T07:50:43Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Improving Domain-Specific Retrieval by NLI Fine-Tuning [64.79760042717822]
This article investigates the fine-tuning potential of natural language inference (NLI) data to improve information retrieval and ranking.
We employ both monolingual and multilingual sentence encoders fine-tuned by a supervised method utilizing contrastive loss and NLI data.
Our results point to the fact that NLI fine-tuning increases the performance of the models in both tasks and both languages, with the potential to improve mono- and multilingual models.
arXiv Detail & Related papers (2023-08-06T12:40:58Z) - Improving Polish to English Neural Machine Translation with Transfer
Learning: Effects of Data Volume and Language Similarity [2.4674086273775035]
We investigate the impact of data volume and the use of similar languages on transfer learning in a machine translation task.
We fine-tune mBART model for a Polish-English translation task using the OPUS-100 dataset.
Our experiments show that a combination of related languages and larger amounts of data outperforms the model trained on related languages or larger amounts of data alone.
arXiv Detail & Related papers (2023-06-01T13:34:21Z) - llm-japanese-dataset v0: Construction of Japanese Chat Dataset for Large
Language Models and its Methodology [4.396516562723691]
This study constructed a Japanese chat dataset for tuning large language models (LLMs), which consist of about 8.4 million records.
The results suggest that our dataset is possibly beneficial for LLMs.
However, we also revealed some difficulties in constructing LLMs in languages other than English.
arXiv Detail & Related papers (2023-05-22T04:59:33Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented
Visual Models [102.63817106363597]
We build ELEVATER, the first benchmark to compare and evaluate pre-trained language-augmented visual models.
It consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge.
We will release our toolkit and evaluation platforms for the research community.
arXiv Detail & Related papers (2022-04-19T10:23:42Z) - Multilingual Neural Semantic Parsing for Low-Resourced Languages [1.6244541005112747]
We introduce a new multilingual semantic parsing dataset in English, Italian and Japanese.
We show that joint multilingual training with pretrained encoders substantially outperforms our baselines on the TOP dataset.
We find that a semantic trained only on English data achieves a zero-shot performance of 44.9% exact-match accuracy on Italian sentences.
arXiv Detail & Related papers (2021-06-07T09:53:02Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.