F-Eval: Asssessing Fundamental Abilities with Refined Evaluation Methods
- URL: http://arxiv.org/abs/2401.14869v1
- Date: Fri, 26 Jan 2024 13:55:32 GMT
- Title: F-Eval: Asssessing Fundamental Abilities with Refined Evaluation Methods
- Authors: Yu Sun, Keyu Chen, Shujie Wang, Qipeng Guo, Hang Yan, Xipeng Qiu,
Xuanjing Huang, Dahua Lin
- Abstract summary: We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
- Score: 111.46455901113976
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) garner significant attention for their
unprecedented performance, leading to an increasing number of researches
evaluating LLMs. However, these evaluation benchmarks are limited to assessing
the instruction-following capabilities, overlooking the fundamental abilities
that emerge during the pre-training stage. Previous subjective evaluation
methods mainly reply on scoring by API models. However, in the absence of
references, large models have shown limited ability to discern subtle
differences. To bridge the gap, we propose F-Eval, a bilingual evaluation
benchmark to evaluate the fundamental abilities, including expression,
commonsense and logic. The tasks in F-Eval include multi-choice objective
tasks, open-ended objective tasks, reference-based subjective tasks and
reference-free subjective tasks. For reference-free subjective tasks, we devise
new evaluation methods, serving as alternatives to scoring by API models. We
conduct evaluations on 13 advanced LLMs. Results show that our evaluation
methods show higher correlation coefficients and larger distinction than other
evaluators. Additionally, we discuss the influence of different model sizes,
dimensions, and normalization methods. We anticipate that F-Eval will
facilitate the study of LLMs' fundamental abilities.
Related papers
- Enhancing LLM Evaluations: The Garbling Trick [0.0]
Large language models (LLMs) become increasingly powerful, making it challenging to distinguish between models based on their performance.
We propose a general method to transform existing LLM evaluations into a series of progressively more difficult tasks.
Our results offer insights into the comparative reasoning abilities of these models, particularly highlighting distinctions between OpenAI's o1-preview and Google's gemini-pro-1.5.
arXiv Detail & Related papers (2024-11-03T11:39:50Z) - From Text to Insight: Leveraging Large Language Models for Performance Evaluation in Management [6.70908766695241]
This study explores the potential of Large Language Models (LLMs), specifically GPT-4, to enhance objectivity in organizational task performance evaluations.
Our results suggest that GPT ratings are comparable to human ratings but exhibit higher consistency and reliability.
Our research suggests that while LLMs are capable of extracting meaningful constructs from text-based data, their scope is currently limited to specific forms of performance evaluation.
arXiv Detail & Related papers (2024-08-09T20:35:10Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Evaluation Gaps in Machine Learning Practice [13.963766987258161]
In practice, evaluations of machine learning models frequently focus on a narrow range of decontextualized predictive behaviours.
We examine the evaluation gaps between the idealized breadth of evaluation concerns and the observed narrow focus of actual evaluations.
By studying these properties, we demonstrate the machine learning discipline's implicit assumption of a range of commitments which have normative impacts.
arXiv Detail & Related papers (2022-05-11T04:00:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.