Related papers: Rapidly Developing High-quality Instruction Data and Evaluation Benchmark for Large Language Models with Minimal Human Effort: A Case Study on Japanese

Rapidly Developing High-quality Instruction Data and Evaluation Benchmark for Large Language Models with Minimal Human Effort: A Case Study on Japanese

URL: http://arxiv.org/abs/2403.03690v1
Date: Wed, 6 Mar 2024 13:17:07 GMT
Title: Rapidly Developing High-quality Instruction Data and Evaluation Benchmark for Large Language Models with Minimal Human Effort: A Case Study on Japanese
Authors: Yikun Sun, Zhen Wan, Nobuhiro Ueda, Sakiko Yahata, Fei Cheng, Chenhui Chu, Sadao Kurohashi
Abstract summary: We propose an efficient self-instruct method based on GPT-4. We first translate a small amount of English instructions into Japanese and post-edit them to obtain native-level quality. GPT-4 then utilizes them as demonstrations to automatically generate Japanese instruction data.
Score: 36.3163608701382
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The creation of instruction data and evaluation benchmarks for serving Large language models often involves enormous human annotation. This issue becomes particularly pronounced when rapidly developing such resources for a non-English language like Japanese. Instead of following the popular practice of directly translating existing English resources into Japanese (e.g., Japanese-Alpaca), we propose an efficient self-instruct method based on GPT-4. We first translate a small amount of English instructions into Japanese and post-edit them to obtain native-level quality. GPT-4 then utilizes them as demonstrations to automatically generate Japanese instruction data. We also construct an evaluation benchmark containing 80 questions across 8 categories, using GPT-4 to automatically assess the response quality of LLMs without human references. The empirical results suggest that the models fine-tuned on our GPT-4 self-instruct data significantly outperformed the Japanese-Alpaca across all three base pre-trained models. Our GPT-4 self-instruct data allowed the LLaMA 13B model to defeat GPT-3.5 (Davinci-003) with a 54.37\% win-rate. The human evaluation exhibits the consistency between GPT-4's assessments and human preference. Our high-quality instruction data and evaluation benchmark have been released here.

Related papers

Generalists vs. Specialists: Evaluating Large Language Models for Urdu [4.8539869147159616]
We compare general-purpose models, GPT-4-Turbo and Llama-3-8b, with special-purpose models--XLM-Roberta-large, mT5-large, and Llama-3-8b--that have been fine-tuned on specific tasks. We focus on seven classification and seven generation tasks to evaluate the performance of these models on Urdu language.
arXiv Detail & Related papers (2024-07-05T12:09:40Z)
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation [93.55550787058012]
This paper presents an automatic, versatile, and human-aligned evaluation metric for text-to-3D generative models. To this end, we first develop a prompt generator using GPT-4V to generate evaluating prompts. We then design a method instructing GPT-4V to compare two 3D assets according to user-defined criteria.
arXiv Detail & Related papers (2024-01-08T18:52:09Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting. CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z)
Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks [9.801767683867125]
We provide a preliminary and hybrid evaluation on three NLP benchmarks using both automatic and human evaluation. We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics. We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks.
arXiv Detail & Related papers (2023-10-20T20:17:09Z)
Efficient Finetuning Large Language Models For Vietnamese Chatbot [1.2075778142867704]
Large language models (LLMs) have been shown to achieve remarkable performance across a variety of natural language tasks. We leverage large-scale instruction-following datasets from open-source projects, namely Alpaca, GPT4All, and Chat-Doctor. We utilize parameter-efficient tuning through Low-Rank Adaptation (LoRA) on two open LLMs, resulting four models: Bloomz-Chat, Bloomz-Doctor, GPTJ-Chat, GPTJ-Doctor.
arXiv Detail & Related papers (2023-09-09T00:11:53Z)
InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4 [14.248735997950446]
We introduce InstructionGPT-4, which is fine-tuned on a small dataset comprising only 200 examples. Based on these metrics, we present an effective and trainable data selector to automatically identify and filter low-quality vision-language data. Our findings demonstrate that less but high-quality instruction tuning data is efficient in enabling multimodal large language models to generate better output.
arXiv Detail & Related papers (2023-08-23T11:27:30Z)
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources [117.6496550359768]
This work explores recent advances in instruction-tuning language models on a range of open instruction-following datasets. We provide a large set of instruction-tuned models from 6.7B to 65B parameters in size, trained on 12 instruction datasets. We evaluate them on their factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities.
arXiv Detail & Related papers (2023-06-07T19:59:23Z)
Instruction Tuning with GPT-4 [107.55078894215798]
We present the first attempt to use GPT-4 to generate instruction-following data for finetuning large language models. Our early experiments on instruction-tuned LLaMA models show that the 52K English and Chinese instruction-following data generated by GPT-4 leads to superior zero-shot performance on new tasks.
arXiv Detail & Related papers (2023-04-06T17:58:09Z)
Large Language Models Are State-of-the-Art Evaluators of Translation Quality [7.818228526742237]
GEMBA is a GPT-based metric for assessment of translation quality. We investigate nine versions of GPT models, including ChatGPT and GPT-4. Our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels.
arXiv Detail & Related papers (2023-02-28T12:23:48Z)
Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages. Our largest model sets new state of the art in few-shot learning in more than 20 representative languages. We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.