ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models
- URL: http://arxiv.org/abs/2403.20262v2
- Date: Mon, 22 Jul 2024 17:24:14 GMT
- Title: ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models
- Authors: Thibaut Thonet, Jos Rozen, Laurent Besacier,
- Abstract summary: We propose a new benchmark for long-context models based on a practical meeting assistant scenario.
Our benchmark, named ELITR-Bench, augments the existing ELITR corpus' transcripts with 271 manually crafted questions and their ground-truth answers.
Our findings suggest that while GPT-4's evaluation scores are correlated with human judges', its ability to differentiate among more than three score levels may be limited.
- Score: 25.74741863885925
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Research on Large Language Models (LLMs) has recently witnessed an increasing interest in extending models' context size to better capture dependencies within long documents. While benchmarks have been proposed to assess long-range abilities, existing efforts primarily considered generic tasks that are not necessarily aligned with real-world applications. In contrast, our work proposes a new benchmark for long-context LLMs focused on a practical meeting assistant scenario. In this scenario, the long contexts consist of transcripts obtained by automatic speech recognition, presenting unique challenges for LLMs due to the inherent noisiness and oral nature of such data. Our benchmark, named ELITR-Bench, augments the existing ELITR corpus' transcripts with 271 manually crafted questions and their ground-truth answers. Our experiments with recent long-context LLMs on ELITR-Bench highlight a gap between open-source and proprietary models, especially when questions are asked sequentially within a conversation. We also provide a thorough analysis of our GPT-4-based evaluation method, encompassing insights from a crowdsourcing study. Our findings suggest that while GPT-4's evaluation scores are correlated with human judges', its ability to differentiate among more than three score levels may be limited.
Related papers
- IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark [22.238377215355545]
We introduce IdentifyMe, a new benchmark for mention resolution presented in a multiple-choice question (MCQ) format.
We observe a significant performance gap between the state-of-the-art sub-10B open models vs. closed ones.
The highest-scoring model, GPT-4o, achieves 81.9% accuracy, highlighting the strong referential capabilities of state-of-the-art LLMs.
arXiv Detail & Related papers (2024-11-12T01:05:55Z) - PersoBench: Benchmarking Personalized Response Generation in Large Language Models [6.8046587254152735]
We present a new benchmark, PersoBench, to evaluate the personalization ability of large language models (LLMs) in persona-aware dialogue generation.
Our analysis, conducted on three well-known persona-aware datasets, evaluates multiple dimensions of response quality, including fluency, diversity, coherence, and personalization.
arXiv Detail & Related papers (2024-10-04T07:29:41Z) - Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows.
We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA)
Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z) - Language Models can Evaluate Themselves via Probability Discrepancy [38.54454263880133]
We propose a new self-evaluation method ProbDiff for assessing the efficacy of various Large Language Models (LLMs)
It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions.
Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4.
arXiv Detail & Related papers (2024-05-17T03:50:28Z) - Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs)
Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities.
We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z) - Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models [14.906150451947443]
textbfCounting-Stars is a multi-evidence, position-aware, and scalable benchmark designed to evaluate the multi-evidence retrieval capabilities of long-context LLMs.
We conduct experiments to evaluate several long-context LLMs, including GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4, and Moonshot-v1.
Gemini 1.5 Pro achieves the best overall results, while GPT-4 Turbo exhibits the most stable performance across various tasks.
arXiv Detail & Related papers (2024-03-18T14:01:45Z) - Evaluating Very Long-Term Conversational Memory of LLM Agents [95.84027826745609]
We introduce a machine-human pipeline to generate high-quality, very long-term dialogues.
We equip each agent with the capability of sharing and reacting to images.
The generated conversations are verified and edited by human annotators for long-range consistency.
arXiv Detail & Related papers (2024-02-27T18:42:31Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - L-Eval: Instituting Standardized Evaluation for Long Context Language
Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs)
We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs.
Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.