llm-japanese-dataset v0: Construction of Japanese Chat Dataset for Large
Language Models and its Methodology
- URL: http://arxiv.org/abs/2305.12720v1
- Date: Mon, 22 May 2023 04:59:33 GMT
- Title: llm-japanese-dataset v0: Construction of Japanese Chat Dataset for Large
Language Models and its Methodology
- Authors: Masanori Hirano, Masahiro Suzuki, Hiroki Sakaji
- Abstract summary: This study constructed a Japanese chat dataset for tuning large language models (LLMs), which consist of about 8.4 million records.
The results suggest that our dataset is possibly beneficial for LLMs.
However, we also revealed some difficulties in constructing LLMs in languages other than English.
- Score: 4.396516562723691
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study constructed a Japanese chat dataset for tuning large language
models (LLMs), which consist of about 8.4 million records. Recently, LLMs have
been developed and gaining popularity. However, high-performing LLMs are
usually mainly for English. There are two ways to support languages other than
English by those LLMs: constructing LLMs from scratch or tuning existing
models. However, in both ways, datasets are necessary parts. In this study, we
focused on supporting Japanese in those LLMs and making a dataset for training
or tuning LLMs in Japanese. The dataset we constructed consisted of various
tasks, such as translation and knowledge tasks. In our experiment, we tuned an
existing LLM using our dataset and evaluated the performance qualitatively. The
results suggest that our dataset is possibly beneficial for LLMs. However, we
also revealed some difficulties in constructing LLMs in languages other than
English.
Related papers
- Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.
Currently, instruction-tuned large language models (LLMs) excel at various English tasks.
Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z) - Enhancing Discriminative Tasks by Guiding the Pre-trained Language Model with Large Language Model's Experience [4.814313782484443]
Large Language Models (LLMs) and pre-trained Language Models (LMs) have achieved impressive success on many software engineering tasks.
We use LLMs to generate domain-specific data, thereby improving the performance of pre-trained LMs on the target tasks.
arXiv Detail & Related papers (2024-08-16T06:37:59Z) - Building pre-train LLM Dataset for the INDIC Languages: a case study on Hindi [0.745652600521932]
We are proposing a large pre-train dataset in Hindi useful for the Indic language Hindi.
The dataset contains 1.28 billion Hindi tokens.
arXiv Detail & Related papers (2024-07-13T11:29:20Z) - Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners [67.85635044939836]
Large Language Models (LLMs) have shown impressive language capabilities.
In this work, we investigate the spontaneous multilingual alignment improvement of LLMs.
We find that LLMs instruction-tuned on the question translation data (i.e. without annotated answers) are able to encourage the alignment between English and a wide range of languages.
arXiv Detail & Related papers (2024-05-22T16:46:19Z) - Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages [0.0]
Large Language Models (LLMs) have shown incredible proficiency at natural language processing tasks.
LLMs often struggle to perform well on low-resource languages because there is so little training data available.
In this work, we explore training LLaMA-2 to speak Amharic, a language which is spoken by over 50 million people world wide.
arXiv Detail & Related papers (2024-03-11T01:04:36Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - Introducing Bode: A Fine-Tuned Large Language Model for Portuguese
Prompt-Based Task [1.158680734110387]
This work proposes a fine-tuned LLaMA 2-based model for Portuguese prompts named Bode.
We evaluate the performance of this model in classification tasks using the zero-shot approach with in-context learning.
arXiv Detail & Related papers (2024-01-05T17:15:01Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - Okapi: Instruction-tuned Large Language Models in Multiple Languages
with Reinforcement Learning from Human Feedback [61.83548032416181]
We present Okapi, the first system with instruction-tuned LLMs based on RLHF for multiple languages.
Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research.
arXiv Detail & Related papers (2023-07-29T18:01:46Z) - Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis [103.89753784762445]
Large language models (LLMs) have demonstrated remarkable potential in handling multilingual machine translation (MMT)
This paper systematically investigates the advantages and challenges of LLMs for MMT.
We thoroughly evaluate eight popular LLMs, including ChatGPT and GPT-4.
arXiv Detail & Related papers (2023-04-10T15:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.