ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models
- URL: http://arxiv.org/abs/2403.20262v2
- Date: Mon, 22 Jul 2024 17:24:14 GMT
- Title: ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models
- Authors: Thibaut Thonet, Jos Rozen, Laurent Besacier,
- Abstract summary: We propose a new benchmark for long-context models based on a practical meeting assistant scenario.
Our benchmark, named ELITR-Bench, augments the existing ELITR corpus' transcripts with 271 manually crafted questions and their ground-truth answers.
Our findings suggest that while GPT-4's evaluation scores are correlated with human judges', its ability to differentiate among more than three score levels may be limited.
- Score: 25.74741863885925
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Research on Large Language Models (LLMs) has recently witnessed an increasing interest in extending models' context size to better capture dependencies within long documents. While benchmarks have been proposed to assess long-range abilities, existing efforts primarily considered generic tasks that are not necessarily aligned with real-world applications. In contrast, our work proposes a new benchmark for long-context LLMs focused on a practical meeting assistant scenario. In this scenario, the long contexts consist of transcripts obtained by automatic speech recognition, presenting unique challenges for LLMs due to the inherent noisiness and oral nature of such data. Our benchmark, named ELITR-Bench, augments the existing ELITR corpus' transcripts with 271 manually crafted questions and their ground-truth answers. Our experiments with recent long-context LLMs on ELITR-Bench highlight a gap between open-source and proprietary models, especially when questions are asked sequentially within a conversation. We also provide a thorough analysis of our GPT-4-based evaluation method, encompassing insights from a crowdsourcing study. Our findings suggest that while GPT-4's evaluation scores are correlated with human judges', its ability to differentiate among more than three score levels may be limited.
Related papers
- IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark [22.238377215355545]
We introduce IdentifyMe, a new benchmark for mention resolution presented in a multiple-choice question (MCQ) format.
We observe a significant performance gap between the state-of-the-art sub-10B open models vs. closed ones.
The highest-scoring model, GPT-4o, achieves 81.9% accuracy, highlighting the strong referential capabilities of state-of-the-art LLMs.
arXiv Detail & Related papers (2024-11-12T01:05:55Z) - ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage [21.036912648701264]
We introduce a new metric called information coverage (IC) which quantifies the proportion of the input context necessary for answering queries.
We present ETHIC, a novel benchmark designed to assess LLMs' ability to leverage the entire context.
arXiv Detail & Related papers (2024-10-22T09:35:42Z) - DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels [89.51834016940153]
We introduce DetectiveQA, a narrative reasoning benchmark with an average context length of over 100K tokens.
We use detective novels as data sources, which naturally have various reasoning elements.
We manually annotated 600 questions in Chinese and then also provided an English edition of the context information and questions.
arXiv Detail & Related papers (2024-09-04T06:28:22Z) - Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows.
We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA)
Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z) - Exploring Large Language Models for Relevance Judgments in Tetun [0.03683202928838613]
This paper explores the feasibility of using large language models (LLMs) to automate relevance assessments.
LLMs are employed to automate relevance judgment tasks, by providing a series of query-document pairs in Tetun as the input text.
Our investigation reveals results that align closely with those reported in studies of high-resource languages.
arXiv Detail & Related papers (2024-06-11T14:28:24Z) - Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs)
Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities.
We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z) - Evaluating Very Long-Term Conversational Memory of LLM Agents [95.84027826745609]
We introduce a machine-human pipeline to generate high-quality, very long-term dialogues.
We equip each agent with the capability of sharing and reacting to images.
The generated conversations are verified and edited by human annotators for long-range consistency.
arXiv Detail & Related papers (2024-02-27T18:42:31Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - L-Eval: Instituting Standardized Evaluation for Long Context Language
Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs)
We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs.
Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.