Evalverse: Unified and Accessible Library for Large Language Model Evaluation
- URL: http://arxiv.org/abs/2404.00943v2
- Date: Mon, 07 Oct 2024 02:47:36 GMT
- Title: Evalverse: Unified and Accessible Library for Large Language Model Evaluation
- Authors: Jihoo Kim, Wonho Song, Dahyun Kim, Yunsu Kim, Yungi Kim, Chanjun Park,
- Abstract summary: We introduce Evalverse, a novel library that streamlines the evaluation of Large Language Models (LLMs)
Evalverse enables individuals with limited knowledge of artificial intelligence to easily request LLM evaluations and receive detailed reports.
We provide a demo video for Evalverse, showcasing its capabilities and implementation in a two-minute format.
- Score: 8.49602675597486
- License:
- Abstract: This paper introduces Evalverse, a novel library that streamlines the evaluation of Large Language Models (LLMs) by unifying disparate evaluation tools into a single, user-friendly framework. Evalverse enables individuals with limited knowledge of artificial intelligence to easily request LLM evaluations and receive detailed reports, facilitated by an integration with communication platforms like Slack. Thus, Evalverse serves as a powerful tool for the comprehensive assessment of LLMs, offering both researchers and practitioners a centralized and easily accessible evaluation framework. Finally, we also provide a demo video for Evalverse, showcasing its capabilities and implementation in a two-minute format.
Related papers
- EvalGIM: A Library for Evaluating Generative Image Models [26.631349186382664]
We introduce EvalGIM, a library for evaluating text-to-image generative models.
EvalGIM contains broad support for datasets and metrics used to measure quality, diversity, and consistency.
EvalGIM also contains Evaluation Exercises that introduce two new analysis methods for text-to-image generative models.
arXiv Detail & Related papers (2024-12-13T23:15:35Z) - Streamlining the review process: AI-generated annotations in research manuscripts [0.5735035463793009]
This study explores the potential of integrating Large Language Models (LLMs) into the peer-review process to enhance efficiency without compromising effectiveness.
We focus on manuscript annotations, particularly excerpt highlights, as a potential area for AI-human collaboration.
This paper introduces AnnotateGPT, a platform that utilizes GPT-4 for manuscript review, aiming to improve reviewers' comprehension and focus.
arXiv Detail & Related papers (2024-11-29T23:26:34Z) - LLMBox: A Comprehensive Library for Large Language Models [109.15654830320553]
This paper presents a comprehensive and unified library, LLMBox, to ease the development, use, and evaluation of large language models (LLMs)
This library is featured with three main merits: (1) a unified data interface that supports the flexible implementation of various training strategies, (2) a comprehensive evaluation that covers extensive tasks, datasets, and models, and (3) more practical consideration, especially on user-friendliness and efficiency.
arXiv Detail & Related papers (2024-07-08T02:39:33Z) - Peer Review as A Multi-Turn and Long-Context Dialogue with Role-Based Interactions [62.0123588983514]
Large Language Models (LLMs) have demonstrated wide-ranging applications across various fields.
We reformulate the peer-review process as a multi-turn, long-context dialogue, incorporating distinct roles for authors, reviewers, and decision makers.
We construct a comprehensive dataset containing over 26,841 papers with 92,017 reviews collected from multiple sources.
arXiv Detail & Related papers (2024-06-09T08:24:17Z) - UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs [74.1976921342982]
This paper introduces UltraEval, a user-friendly evaluation framework characterized by its lightweight nature, comprehensiveness, modularity, and efficiency.
The resulting composability allows for the free combination of different models, tasks, prompts, benchmarks, and metrics within a unified evaluation workflow.
arXiv Detail & Related papers (2024-04-11T09:17:12Z) - FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models [36.273451767886726]
FreeEval is a modular and scalable framework crafted to enable trustworthy and efficient automatic evaluations of large language models.
FreeEval's unified abstractions simplify the integration and improve the transparency of diverse evaluation methodologies.
The framework integrates meta-evaluation techniques like human evaluation and data contamination detection, which, along with dynamic evaluation modules, enhance the fairness of the evaluation outcomes.
arXiv Detail & Related papers (2024-04-09T04:17:51Z) - Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z) - ReForm-Eval: Evaluating Large Vision Language Models via Unified
Re-Formulation of Task-Oriented Benchmarks [76.25209974199274]
Large vision-language models (LVLMs) exhibit surprising capabilities to perceive visual signals and perform visually grounded reasoning.
Our benchmark and evaluation framework will be open-sourced as a cornerstone for advancing the development of LVLMs.
arXiv Detail & Related papers (2023-10-04T04:07:37Z) - Improving Language Models via Plug-and-Play Retrieval Feedback [42.786225163763376]
Large language models (LLMs) exhibit remarkable performance across various NLP tasks.
They often generate incorrect or hallucinated information, which hinders their practical applicability in real-world scenarios.
We introduce ReFeed, a novel pipeline designed to enhance LLMs by providing automatic retrieval feedback in a plug-and-play framework.
arXiv Detail & Related papers (2023-05-23T12:29:44Z) - Rethinking the Evaluation for Conversational Recommendation in the Era
of Large Language Models [115.7508325840751]
The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs)
In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol.
We propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators.
arXiv Detail & Related papers (2023-05-22T15:12:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.