Related papers: StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation

StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation

URL: http://arxiv.org/abs/2408.03281v2
Date: Wed, 7 Aug 2024 01:00:55 GMT
Title: StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation
Authors: Boxi Cao, Mengjie Ren, Hongyu Lin, Xianpei Han, Feng Zhang, Junfeng Zhan, Le Sun,
Abstract summary: We propose a novel evaluation framework referred to as StructEval. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a structured assessment across multiple cognitive levels and critical concepts. Experiments on three widely-used benchmarks demonstrate that StructEval serves as a reliable tool for resisting the risk of data contamination.
Score: 46.59416831869014
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluation is the baton for the development of large language models. Current evaluations typically employ a single-item assessment paradigm for each atomic test objective, which struggles to discern whether a model genuinely possesses the required capabilities or merely memorizes/guesses the answers to specific questions. To this end, we propose a novel evaluation framework referred to as StructEval. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a structured assessment across multiple cognitive levels and critical concepts, and therefore offers a comprehensive, robust and consistent evaluation for LLMs. Experiments on three widely-used benchmarks demonstrate that StructEval serves as a reliable tool for resisting the risk of data contamination and reducing the interference of potential biases, thereby providing more reliable and consistent conclusions regarding model capabilities. Our framework also sheds light on the design of future principled and trustworthy LLM evaluation protocols.

Related papers

LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z)
Revisiting Model Inversion Evaluation: From Misleading Standards to Reliable Privacy Assessment [63.07424521895492]
Model Inversion (MI) attacks aim to reconstruct information from private training data by exploiting access to machine learning models T.<n>The standard evaluation framework for such attacks relies on an evaluation model E, trained under the same task design as T.<n>This framework has become the de facto standard for assessing progress in MI research, used across nearly all recent MI attacks and defenses without question.
arXiv Detail & Related papers (2025-05-06T13:32:12Z)
Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework [61.38174427966444]
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios. Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models. We propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses.
arXiv Detail & Related papers (2025-02-26T06:31:45Z)
StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs. Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets. We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z)
The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? [1.3810901729134184]
Large Language Models (LLMs) excel at standardized tests while failing to demonstrate genuine language understanding and adaptability. Our systematic analysis of NLP evaluation frameworks reveals pervasive vulnerabilities across the evaluation spectrum. We lay the groundwork for new evaluation methods that resist manipulation, minimize data contamination, and assess domain-specific tasks.
arXiv Detail & Related papers (2024-12-02T20:49:21Z)
MIBench: A Comprehensive Framework for Benchmarking Model Inversion Attack and Defense [42.56467639172508]
Model Inversion (MI) attacks aim at leveraging the output information of target models to reconstruct privacy-sensitive training data. We build the first practical benchmark named MIBench for systematic evaluation of model inversion attacks and defenses.
arXiv Detail & Related papers (2024-10-07T16:13:49Z)
Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries [54.325172923155414]
We introduce Michelangelo: a minimal, synthetic, and unleaked long-context reasoning evaluation for large language models. This evaluation is derived via a novel, unifying framework for evaluations over arbitrarily long contexts.
arXiv Detail & Related papers (2024-09-19T10:38:01Z)
Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks [3.773596042872403]
Large language models (LLMs) continue to evolve, the need for robust and standardized evaluation benchmarks becomes paramount. Various frameworks have emerged as noteworthy contributions to the field, offering comprehensive evaluation tests and benchmarks. This paper provides an exploration and critical analysis of some of these evaluation methodologies, shedding light on their strengths, limitations, and impact on advancing the state-of-the-art in natural language processing.
arXiv Detail & Related papers (2024-07-29T03:37:14Z)
CheckEval: Robust Evaluation Framework using Large Language Model via Checklist [6.713203569074019]
We introduce CheckEval, a novel evaluation framework using Large Language Models. CheckEval addresses the challenges of ambiguity and inconsistency in current evaluation methods.
arXiv Detail & Related papers (2024-03-27T17:20:39Z)
F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic. For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z)
Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs. We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z)
Establishing Trustworthiness: Rethinking Tasks and Model Evaluation [36.329415036660535]
We argue that it is time to rethink what constitutes tasks and model evaluation in NLP. We review existing compartmentalized approaches for understanding the origins of a model's functional capacity.
arXiv Detail & Related papers (2023-10-09T06:32:10Z)
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation. We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z)
A Comprehensive Evaluation Framework for Deep Model Robustness [44.20580847861682]
Deep neural networks (DNNs) have achieved remarkable performance across a wide area of applications. They are vulnerable to adversarial examples, which motivates the adversarial defense. This paper presents a model evaluation framework containing a comprehensive, rigorous, and coherent set of evaluation metrics.
arXiv Detail & Related papers (2021-01-24T01:04:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.