CLongEval: A Chinese Benchmark for Evaluating Long-Context Large
Language Models
- URL: http://arxiv.org/abs/2403.03514v1
- Date: Wed, 6 Mar 2024 07:43:43 GMT
- Title: CLongEval: A Chinese Benchmark for Evaluating Long-Context Large
Language Models
- Authors: Zexuan Qiu, Jingjing Li, Shijue Huang, Wanjun Zhong, Irwin King
- Abstract summary: We present CLongEval, a comprehensive Chinese benchmark for evaluating long-context LLMs.
CLongEval is characterized by three key features: (1) Sufficient data volume, comprising 7 distinct tasks and 7,267 examples; (2) Broad applicability, accommodating to models with context windows size from 1K to 100K; (3) High quality, with over 2,000 manually annotated question-answer pairs in addition to the automatically constructed labels.
- Score: 52.092128293192914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developing Large Language Models (LLMs) with robust long-context capabilities
has been the recent research focus, resulting in the emergence of long-context
LLMs proficient in Chinese. However, the evaluation of these models remains
underdeveloped due to a lack of benchmarks. To address this gap, we present
CLongEval, a comprehensive Chinese benchmark for evaluating long-context LLMs.
CLongEval is characterized by three key features: (1) Sufficient data volume,
comprising 7 distinct tasks and 7,267 examples; (2) Broad applicability,
accommodating to models with context windows size from 1K to 100K; (3) High
quality, with over 2,000 manually annotated question-answer pairs in addition
to the automatically constructed labels. With CLongEval, we undertake a
comprehensive assessment of 6 open-source long-context LLMs and 2 leading
commercial counterparts that feature both long-context abilities and
proficiency in Chinese. We also provide in-depth analysis based on the
empirical results, trying to shed light on the critical capabilities that
present challenges in long-context settings. The dataset, evaluation scripts,
and model outputs will be released.
Related papers
- Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows.
We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA)
Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z) - Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs)
Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities.
We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z) - XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies [45.31042312867939]
Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks but are constrained by their small context window sizes.
Various efforts have been proposed to expand the context window to accommodate even up to 200K input tokens.
We introduce a benchmark for extremely long context understanding with long-range dependencies, XL$2$Bench.
arXiv Detail & Related papers (2024-04-08T12:29:07Z) - LooGLE: Can Long-Context Language Models Understand Long Contexts? [50.408957515411096]
LooGLE is a benchmark for large language models' long context understanding.
It features relatively new documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning diverse domains.
The evaluation of eight state-of-the-art LLMs on LooGLE revealed key findings.
arXiv Detail & Related papers (2023-11-08T01:45:37Z) - LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [58.20031627237889]
LongBench is the first bilingual, multi-task benchmark for long context understanding.
It comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese)
arXiv Detail & Related papers (2023-08-28T11:53:40Z) - L-Eval: Instituting Standardized Evaluation for Long Context Language
Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs)
We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs.
Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.