Related papers: FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models

FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models

URL: http://arxiv.org/abs/2404.06003v1
Date: Tue, 9 Apr 2024 04:17:51 GMT
Title: FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models
Authors: Zhuohao Yu, Chang Gao, Wenjin Yao, Yidong Wang, Zhengran Zeng, Wei Ye, Jindong Wang, Yue Zhang, Shikun Zhang,
Abstract summary: FreeEval is a modular and scalable framework crafted to enable trustworthy and efficient automatic evaluations of large language models. FreeEval's unified abstractions simplify the integration and improve the transparency of diverse evaluation methodologies. The framework integrates meta-evaluation techniques like human evaluation and data contamination detection, which, along with dynamic evaluation modules, enhance the fairness of the evaluation outcomes.
Score: 36.273451767886726
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid development of large language model (LLM) evaluation methodologies and datasets has led to a profound challenge: integrating state-of-the-art evaluation techniques cost-effectively while ensuring reliability, reproducibility, and efficiency. Currently, there is a notable absence of a unified and adaptable framework that seamlessly integrates various evaluation approaches. Moreover, the reliability of evaluation findings is often questionable due to potential data contamination, with the evaluation efficiency commonly overlooked when facing the substantial costs associated with LLM inference. In response to these challenges, we introduce FreeEval, a modular and scalable framework crafted to enable trustworthy and efficient automatic evaluations of LLMs. Firstly, FreeEval's unified abstractions simplify the integration and improve the transparency of diverse evaluation methodologies, encompassing dynamic evaluation that demand sophisticated LLM interactions. Secondly, the framework integrates meta-evaluation techniques like human evaluation and data contamination detection, which, along with dynamic evaluation modules in the platform, enhance the fairness of the evaluation outcomes. Lastly, FreeEval is designed with a high-performance infrastructure, including distributed computation and caching strategies, enabling extensive evaluations across multi-node, multi-GPU clusters for open-source and proprietary LLMs.

Related papers

LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z)
OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics [101.78963920333342]
We introduce OpenUnlearning, a standardized framework for benchmarking large language models (LLMs) unlearning methods and metrics.<n>OpenUnlearning integrates 9 unlearning algorithms and 16 diverse evaluations across 3 leading benchmarks.<n>We also benchmark diverse unlearning methods and provide a comparative analysis against an extensive evaluation suite.
arXiv Detail & Related papers (2025-06-14T20:16:37Z)
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering [0.0]
Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation remains underexplored.<n>We introduce YESciEval, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in evaluators.
arXiv Detail & Related papers (2025-05-20T12:30:46Z)
Cer-Eval: Certifiable and Cost-Efficient Evaluation Framework for LLMs [29.764833226591012]
This paper introduces a certifiable and cost-efficient evaluation framework for large language models (LLMs)<n>We use test sample complexity'' to quantify the number of test points needed for a certifiable evaluation and derive tight bounds on test sample complexity.<n>Based on the developed theory, we develop a partition-based algorithm, named Cer-Eval, that adaptively selects test points to minimize the cost of LLM evaluation.
arXiv Detail & Related papers (2025-05-02T17:05:01Z)
The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? [1.3810901729134184]
Large Language Models (LLMs) excel at standardized tests while failing to demonstrate genuine language understanding and adaptability. Our systematic analysis of NLP evaluation frameworks reveals pervasive vulnerabilities across the evaluation spectrum. We lay the groundwork for new evaluation methods that resist manipulation, minimize data contamination, and assess domain-specific tasks.
arXiv Detail & Related papers (2024-12-02T20:49:21Z)
SelfPrompt: Autonomously Evaluating LLM Robustness via Domain-Constrained Knowledge Guidelines and Refined Adversarial Prompts [0.6291443816903801]
This paper introduces a novel framework designed to autonomously evaluate the robustness of large language models (LLMs) Our method generates descriptive sentences from domain-constrained knowledge graph triplets to formulate adversarial prompts. This self-evaluation mechanism allows the LLM to evaluate its robustness without the need for external benchmarks.
arXiv Detail & Related papers (2024-12-01T10:58:53Z)
Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset. We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6. Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z)
Unveiling Context-Aware Criteria in Self-Assessing LLMs [28.156979106994537]
We propose a novel Self-Assessing LLM framework that integrates Context-Aware Criteria (SALC) with dynamic knowledge tailored to each evaluation instance. Empirical evaluations demonstrate that our approach significantly outperforms existing baseline evaluation frameworks. Our method also exhibits a improvement in LC Win-Rate in AlpacaEval2 leaderboard up to a 12% when employed for preference data generation.
arXiv Detail & Related papers (2024-10-28T21:18:49Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs) MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models [71.8065384742686]
LMMS-EVAL is a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models. LMMS-EVAL LITE is a pruned evaluation toolkit that emphasizes both coverage and efficiency. Multimodal LIVEBENCH utilizes continuously updating news and online forums to assess models' generalization abilities in the wild.
arXiv Detail & Related papers (2024-07-17T17:51:53Z)
FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective Wisdom [19.104850413126066]
Federated Learning (FL) has emerged as a promising solution for collaborative training of large language models (LLMs) Traditional evaluation methods that rely on labeled test sets and similarity-based metrics cover only a subset of the acceptable answers. We propose FedEval-LLM that provides reliable performance measurements of LLMs on downstream tasks without the reliance on labeled test sets and external tools.
arXiv Detail & Related papers (2024-04-18T15:46:26Z)
UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs [74.1976921342982]
This paper introduces UltraEval, a user-friendly evaluation framework characterized by its lightweight nature, comprehensiveness, modularity, and efficiency. The resulting composability allows for the free combination of different models, tasks, prompts, benchmarks, and metrics within a unified evaluation workflow.
arXiv Detail & Related papers (2024-04-11T09:17:12Z)
MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation [22.19073789961769]
generative Large Language Models (LLMs) have been remarkable, however, the quality of the text generated by these models often reveals persistent issues. We propose the MATEval: A "Multi-Agent Text Evaluation framework" Our framework incorporates self-reflection and Chain-of-Thought strategies, along with feedback mechanisms, to enhance the depth and breadth of the evaluation process.
arXiv Detail & Related papers (2024-03-28T10:41:47Z)
CheckEval: Robust Evaluation Framework using Large Language Model via Checklist [6.713203569074019]
We introduce CheckEval, a novel evaluation framework using Large Language Models. CheckEval addresses the challenges of ambiguity and inconsistency in current evaluation methods.
arXiv Detail & Related papers (2024-03-27T17:20:39Z)
KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models. It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z)
LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models [28.441725610692714]
We propose a unified multi-dimensional automatic evaluation method for open-domain conversations with large language models (LLMs) We design a single prompt-based evaluation method that leverages a unified evaluation schema to cover multiple dimensions of conversation quality in a single model call. We extensively evaluate the performance of LLM-Eval on various benchmark datasets, demonstrating its effectiveness, efficiency, and adaptability compared to state-of-the-art evaluation methods.
arXiv Detail & Related papers (2023-05-23T05:57:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.