INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large
Language Models
- URL: http://arxiv.org/abs/2306.04757v3
- Date: Thu, 15 Jun 2023 05:08:56 GMT
- Title: INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large
Language Models
- Authors: Yew Ken Chia, Pengfei Hong, Lidong Bing, Soujanya Poria
- Abstract summary: INSTRUCTEVAL is a more comprehensive evaluation suite designed specifically for instruction-tuned large language models.
We take a holistic approach to analyze various factors affecting model performance, including the pretraining foundation, instruction-tuning data, and training methods.
Our findings reveal that the quality of instruction data is the most crucial factor in scaling model performance.
- Score: 39.46610170563634
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Instruction-tuned large language models have revolutionized natural language
processing and have shown great potential in applications such as
conversational agents. These models, such as GPT-4, can not only master
language but also solve complex tasks in areas like mathematics, coding,
medicine, and law. Despite their impressive capabilities, there is still a lack
of comprehensive understanding regarding their full potential, primarily due to
the black-box nature of many models and the absence of holistic evaluation
studies. To address these challenges, we present INSTRUCTEVAL, a more
comprehensive evaluation suite designed specifically for instruction-tuned
large language models. Unlike previous works, our evaluation involves a
rigorous assessment of models based on problem-solving, writing ability, and
alignment to human values. We take a holistic approach to analyze various
factors affecting model performance, including the pretraining foundation,
instruction-tuning data, and training methods. Our findings reveal that the
quality of instruction data is the most crucial factor in scaling model
performance. While open-source models demonstrate impressive writing abilities,
there is substantial room for improvement in problem-solving and alignment. We
are encouraged by the rapid development of models by the open-source community,
but we also highlight the need for rigorous evaluation to support claims made
about these models. Through INSTRUCTEVAL, we aim to foster a deeper
understanding of instruction-tuned models and advancements in their
capabilities. INSTRUCTEVAL is publicly available at
https://github.com/declare-lab/instruct-eval.
Related papers
- Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models.
We first propose AutoMathCritique, an automated and scalable framework for collecting critique data.
We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z) - SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - Can I understand what I create? Self-Knowledge Evaluation of Large Language Models [31.85129258347539]
Large language models (LLMs) have achieved remarkable progress in linguistic tasks.
Inspired by Feynman's principle of understanding through creation, we introduce a self-knowledge evaluation framework.
arXiv Detail & Related papers (2024-06-10T09:53:54Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Large Language Models Are Also Good Prototypical Commonsense Reasoners [11.108562540123387]
Traditional fine-tuning approaches can be resource-intensive and potentially compromise a model's generalization capacity.
We draw inspiration from the outputs of large models for tailored tasks and semi-automatically developed a set of novel prompts.
With better designed prompts we can achieve the new state-of-art(SOTA) on the ProtoQA leaderboard.
arXiv Detail & Related papers (2023-09-22T20:07:24Z) - How Far Can Camels Go? Exploring the State of Instruction Tuning on Open
Resources [117.6496550359768]
This work explores recent advances in instruction-tuning language models on a range of open instruction-following datasets.
We provide a large set of instruction-tuned models from 6.7B to 65B parameters in size, trained on 12 instruction datasets.
We evaluate them on their factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities.
arXiv Detail & Related papers (2023-06-07T19:59:23Z) - Towards Better Instruction Following Language Models for Chinese:
Investigating the Impact of Training Data and Evaluation [12.86275938443485]
We examine the influence of training data factors, including quantity, quality, and linguistic distribution, on model performance.
We assess various models using a evaluation set of 1,000 samples, encompassing nine real-world scenarios.
We extend the vocabulary of LLaMA - the model with the closest open-source performance to proprietary language models like GPT-3.
arXiv Detail & Related papers (2023-04-16T18:37:39Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in
Natural Language Understanding [1.827510863075184]
Curriculum is a new format of NLI benchmark for evaluation of broad-coverage linguistic phenomena.
We show that this linguistic-phenomena-driven benchmark can serve as an effective tool for diagnosing model behavior and verifying model learning quality.
arXiv Detail & Related papers (2022-04-13T10:32:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.