LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking
- URL: http://arxiv.org/abs/2308.04945v2
- Date: Mon, 26 Feb 2024 13:33:43 GMT
- Title: LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking
- Authors: Fahim Dalvi, Maram Hasanain, Sabri Boughorbel, Basel Mousi, Samir
Abdaljalil, Nizi Nazar, Ahmed Abdelali, Shammur Absar Chowdhury, Hamdy
Mubarak, Ahmed Ali, Majd Hawasly, Nadir Durrani, Firoj Alam
- Abstract summary: We introduce the LLMeBench framework, which can be seamlessly customized to evaluate Large Language Models (LLMs) for any NLP task, regardless of language.
A specific dataset and task can be evaluated for a given LLM in less than 20 lines of code while allowing full flexibility to extend the framework for custom datasets, models, or tasks.
The framework has been tested on 31 unique NLP tasks using 53 publicly available datasets within 90 experimental setups, involving approximately 296K data points.
- Score: 26.413008616554816
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The recent development and success of Large Language Models (LLMs)
necessitate an evaluation of their performance across diverse NLP tasks in
different languages. Although several frameworks have been developed and made
publicly available, their customization capabilities for specific tasks and
datasets are often complex for different users. In this study, we introduce the
LLMeBench framework, which can be seamlessly customized to evaluate LLMs for
any NLP task, regardless of language. The framework features generic dataset
loaders, several model providers, and pre-implements most standard evaluation
metrics. It supports in-context learning with zero- and few-shot settings. A
specific dataset and task can be evaluated for a given LLM in less than 20
lines of code while allowing full flexibility to extend the framework for
custom datasets, models, or tasks. The framework has been tested on 31 unique
NLP tasks using 53 publicly available datasets within 90 experimental setups,
involving approximately 296K data points. We open-sourced LLMeBench for the
community (https://github.com/qcri/LLMeBench/) and a video demonstrating the
framework is available online. (https://youtu.be/9cC2m_abk3A)
Related papers
- ULLME: A Unified Framework for Large Language Model Embeddings with Generation-Augmented Learning [72.90823351726374]
We introduce the Unified framework for Large Language Model Embedding (ULLME), a flexible, plug-and-play implementation that enables bidirectional attention across various LLMs.
We also propose Generation-augmented Representation Learning (GRL), a novel fine-tuning method to boost LLMs for text embedding tasks.
To showcase our framework's flexibility and effectiveness, we release three pre-trained models from ULLME with different backbone architectures.
arXiv Detail & Related papers (2024-08-06T18:53:54Z) - PyBench: Evaluating LLM Agent on various real-world coding tasks [13.347173063163138]
PyBench is a benchmark covering five main categories of real-world tasks, covering more than 10 types of files.
Our evaluations indicate that current open-source LLMs are struggling with these tasks.
Our fine-tuned 8B size model: textbfPyLlama3 achieves an exciting performance on PyBench.
arXiv Detail & Related papers (2024-07-23T15:23:14Z) - NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window? [37.64593022203498]
NeedleBench is a framework consisting of progressively more challenging tasks for assessing bilingual long-context capabilities.
We use the framework to assess how well the leading open-source models can identify key information relevant to the question.
We propose the Ancestral Trace Challenge to mimic the complexity of logical reasoning challenges that are likely to be present in real-world long-context tasks.
arXiv Detail & Related papers (2024-07-16T17:59:06Z) - LongIns: A Challenging Long-context Instruction-based Exam for LLMs [44.51209510772957]
Long-context capabilities of large language models (LLMs) have been a hot topic in recent years.
We propose the LongIns benchmark dataset, a challenging long-context instruction-based exam for LLMs.
arXiv Detail & Related papers (2024-06-25T14:31:26Z) - BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [72.56339136017759]
We introduce BigCodeBench, a benchmark that challenges Large Language Models (LLMs) to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks.
Our evaluation shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%.
We propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information.
arXiv Detail & Related papers (2024-06-22T15:52:04Z) - Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs)
Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities.
We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z) - ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models [46.07900122810749]
Large language models (LLMs) have achieved unprecedented performances in various applications, yet evaluating them is still challenging.
We contend that utilizing existing relational databases is a promising approach for constructing benchmarks.
We propose ERBench, which uses these integrity constraints to convert any database into an LLM benchmark.
arXiv Detail & Related papers (2024-03-08T12:42:36Z) - PPTC-R benchmark: Towards Evaluating the Robustness of Large Language
Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels.
We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings.
We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z) - PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task
Completion [96.47420221442397]
We introduce the PowerPoint Task Completion benchmark to assess the ability of Large Language Models to finish multi-turn, multi-modal instructions.
We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence.
The results show that GPT-4 outperforms other LLMs with 75.1% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6% session accuracy.
arXiv Detail & Related papers (2023-11-03T08:06:35Z) - Learning to Retrieve In-Context Examples for Large Language Models [69.9707552694766]
Large language models (LLMs) have demonstrated their ability to learn in-context.
The effectiveness of in-context learning is heavily reliant on the quality of the selected examples.
We propose a novel framework to iteratively train dense retrievers that can identify high-quality in-context examples.
arXiv Detail & Related papers (2023-07-14T05:23:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.