Towards Better Instruction Following Language Models for Chinese:
Investigating the Impact of Training Data and Evaluation
- URL: http://arxiv.org/abs/2304.07854v1
- Date: Sun, 16 Apr 2023 18:37:39 GMT
- Title: Towards Better Instruction Following Language Models for Chinese:
Investigating the Impact of Training Data and Evaluation
- Authors: Yunjie Ji, Yan Gong, Yong Deng, Yiping Peng, Qiang Niu, Baochang Ma,
Xiangang Li
- Abstract summary: We examine the influence of training data factors, including quantity, quality, and linguistic distribution, on model performance.
We assess various models using a evaluation set of 1,000 samples, encompassing nine real-world scenarios.
We extend the vocabulary of LLaMA - the model with the closest open-source performance to proprietary language models like GPT-3.
- Score: 12.86275938443485
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, significant public efforts have been directed towards developing
low-cost models with capabilities akin to ChatGPT, thereby fostering the growth
of open-source conversational models. However, there remains a scarcity of
comprehensive and in-depth evaluations of these models' performance. In this
study, we examine the influence of training data factors, including quantity,
quality, and linguistic distribution, on model performance. Our analysis is
grounded in several publicly accessible, high-quality instruction datasets, as
well as our own Chinese multi-turn conversations. We assess various models
using a evaluation set of 1,000 samples, encompassing nine real-world
scenarios. Our goal is to supplement manual evaluations with quantitative
analyses, offering valuable insights for the continued advancement of
open-source chat models. Furthermore, to enhance the performance and training
and inference efficiency of models in the Chinese domain, we extend the
vocabulary of LLaMA - the model with the closest open-source performance to
proprietary language models like GPT-3 - and conduct secondary pre-training on
3.4B Chinese words. We make our model, data, as well as code publicly
available.
Related papers
- Lessons from the Trenches on Reproducible Evaluation of Language Models [60.522749986793094]
We draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers.
We present the Language Model Evaluation Harness (lm-eval), an open source library for independent, reproducible, and evaluation of language models.
arXiv Detail & Related papers (2024-05-23T16:50:49Z) - Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking [1.3716808114696444]
Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages.
This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations.
arXiv Detail & Related papers (2024-05-07T21:58:45Z) - CroissantLLM: A Truly Bilingual French-English Language Model [42.03897426049679]
We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens.
We pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio.
To assess performance outside of English, we craft a novel benchmark, FrenchBench.
arXiv Detail & Related papers (2024-02-01T17:17:55Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large
Language Models [39.46610170563634]
INSTRUCTEVAL is a more comprehensive evaluation suite designed specifically for instruction-tuned large language models.
We take a holistic approach to analyze various factors affecting model performance, including the pretraining foundation, instruction-tuning data, and training methods.
Our findings reveal that the quality of instruction data is the most crucial factor in scaling model performance.
arXiv Detail & Related papers (2023-06-07T20:12:29Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z) - Panda LLM: Training Data and Evaluation for Open-Sourced Chinese
Instruction-Following Large Language Models [6.725922146703912]
This project focuses on enhancing open-source large language models through instruction-tuning.
We explore how various training data factors, such as quantity, quality, and linguistic distribution, influence the performance of instruction-tuned models.
arXiv Detail & Related papers (2023-05-04T17:49:09Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - AuGPT: Dialogue with Pre-trained Language Models and Data Augmentation [0.0]
We introduce modified training objectives for language model finetuning.
We employ massive data augmentation via back-translation to increase the diversity of the training data.
Our model achieves state-of-the-art performance on the MultiWOZ data and shows competitive performance in human evaluation.
arXiv Detail & Related papers (2021-02-09T20:53:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.