L-Eval: Instituting Standardized Evaluation for Long Context Language
Models
- URL: http://arxiv.org/abs/2307.11088v3
- Date: Wed, 4 Oct 2023 10:04:25 GMT
- Title: L-Eval: Instituting Standardized Evaluation for Long Context Language
Models
- Authors: Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun
Zhang, Lingpeng Kong and Xipeng Qiu
- Abstract summary: We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs)
We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs.
Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
- Score: 91.05820785008527
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, there has been growing interest in extending the context length of
large language models (LLMs), aiming to effectively process long inputs of one
turn or conversations with more extensive histories. While proprietary models
such as GPT-4 and Claude can largely preserve the reasoning ability in an
extended context, open-source models are still progressing through the early
stages of development. To bridge this gap, we propose L-Eval to institute a
more standardized evaluation for long context language models (LCLMs)
addressing two key aspects: dataset construction and evaluation metrics. On the
one hand, we build a new evaluation suite containing 20 sub-tasks, 508 long
documents, and over 2,000 human-labeled query-response pairs encompassing
diverse question styles, domains, and input length (3k$\sim$200k tokens). On
the other hand, we investigate the effectiveness in evalution metrics for
LCLMs. Results show that popular n-gram matching metrics generally can not
correlate well with human judgment, and thus we strongly advocate for
length-instruction-enhanced (LIE) evaluation and employing LLM judges. We
conducted a comprehensive study of 4 popular commercial LLMs and 12 open-source
counterparts using the L-Eval benchmark. Our empirical findings offer useful
insights into the study of LCLMs and lay the groundwork for the development of
more principled evaluation of these models.
Related papers
- Large Language Models Can Self-Improve in Long-context Reasoning [100.52886241070907]
Large language models (LLMs) have achieved substantial progress in processing long contexts but still struggle with long-context reasoning.
We propose ours, an approach specifically designed for this purpose.
ours achieves superior performance compared to prior approaches that depend on data produced by human experts or advanced models.
arXiv Detail & Related papers (2024-11-12T19:53:00Z) - LongIns: A Challenging Long-context Instruction-based Exam for LLMs [44.51209510772957]
Long-context capabilities of large language models (LLMs) have been a hot topic in recent years.
We propose the LongIns benchmark dataset, a challenging long-context instruction-based exam for LLMs.
arXiv Detail & Related papers (2024-06-25T14:31:26Z) - Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs)
Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities.
We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z) - ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models [25.74741863885925]
We propose a new benchmark for long-context models based on a practical meeting assistant scenario.
Our benchmark, named ELITR-Bench, augments the existing ELITR corpus' transcripts with 271 manually crafted questions and their ground-truth answers.
Our findings suggest that while GPT-4's evaluation scores are correlated with human judges', its ability to differentiate among more than three score levels may be limited.
arXiv Detail & Related papers (2024-03-29T16:13:31Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - LooGLE: Can Long-Context Language Models Understand Long Contexts? [46.143956498529796]
LooGLE is a benchmark for large language models' long context understanding.
It features relatively new documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning diverse domains.
The evaluation of eight state-of-the-art LLMs on LooGLE revealed key findings.
arXiv Detail & Related papers (2023-11-08T01:45:37Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models [141.21603469555225]
Large language models (LLMs) have achieved dramatic proficiency over NLP tasks with normal length.
We propose BAMBOO, a multi-task long context benchmark.
It consists of 10 datasets from 5 different long text understanding tasks.
arXiv Detail & Related papers (2023-09-23T11:36:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.