Related papers: Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction

Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction

URL: http://arxiv.org/abs/2407.03651v2
Date: Sun, 14 Jul 2024 22:47:13 GMT
Title: Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction
Authors: Amanda Dsouza, Christopher Glaze, Changho Shin, Frederic Sala,
Abstract summary: Large language models are prominently used in real-world applications, often tasked with reasoning over large volumes of documents. We propose SWiM, an evaluation framework that addresses the limitations of standard tests. We also propose medoid voting, a simple, but effective training-free approach that helps alleviate this effect.
Score: 10.428174043080622
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models are prominently used in real-world applications, often tasked with reasoning over large volumes of documents. An exciting development in this space is models boasting extended context capabilities, with some accommodating over 2 million tokens. Such long context model capabilities remain uncertain in production systems, motivating the need to benchmark their performance on real world use cases. We address this challenge by proposing SWiM, an evaluation framework that addresses the limitations of standard tests. Testing the framework on eight long context models, we find that even strong models such as GPT-4 and Claude 3 Opus degrade in performance when information is present in the middle of the context window (lost-in-the-middle effect). Next, in addition to our benchmark, we propose medoid voting, a simple, but effective training-free approach that helps alleviate this effect, by generating responses a few times, each time randomly permuting documents in the context, and selecting the medoid answer. We evaluate medoid voting on single document QA tasks, achieving up to a 24% lift in accuracy. Our code is available at https://github.com/snorkel-ai/long-context-eval.

Related papers

ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation [33.22383550511664]
ArenaBencher is a model-agnostic framework for automatic benchmark evolution.<n>We apply ArenaBencher to math problem solving, commonsense reasoning, and safety domains.
arXiv Detail & Related papers (2025-10-09T17:59:55Z)
KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues [58.305425399644086]
Multi-Turn Long-Form Question Answering (MT-LFQA) is a key application paradigm of Large Language Models (LLMs) in knowledge-intensive domains.<n>We introduce textbfKnowMT-Bench, the textitfirst-ever benchmark designed to systematically evaluate MT-LFQA for LLMs across knowledge-intensive fields.
arXiv Detail & Related papers (2025-09-26T04:32:29Z)
AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z)
A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench [18.149327897427234]
We present clembench, which has been in continuous development since 2023 and has in its latest release been optimized for ease of general use.<n>We describe how it can be used to benchmark one's own models (using a provided set of benchmark game instances in English)
arXiv Detail & Related papers (2025-07-11T11:16:01Z)
SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)<n>Unlike traditional static benchmarks, SwingArena models the collaborative process of software by pairing LLMs as iterations, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines.
arXiv Detail & Related papers (2025-05-29T18:28:02Z)
Breaking Focus: Contextual Distraction Curse in Large Language Models [68.4534308805202]
We investigate a critical vulnerability in Large Language Models (LLMs) This phenomenon arises when models fail to maintain consistent performance on questions modified with semantically coherent but irrelevant context. We propose an efficient tree-based search methodology to automatically generate CDV examples.
arXiv Detail & Related papers (2025-02-03T18:43:36Z)
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z)
ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models [12.035509884945789]
We introduce a tuning-free framework called ReFeR, designed to evaluate generative outputs, including both text and images. We rigorously evaluate our framework, ReFeR, across four diverse evaluation tasks. Experiments on four reasoning tasks demonstrate superior collective reasoning abilities of the framework.
arXiv Detail & Related papers (2024-07-16T08:25:26Z)
AutoBencher: Towards Declarative Benchmark Construction [74.54640925146289]
We use AutoBencher to create datasets for math, multilinguality, knowledge, and safety. The scalability of AutoBencher allows it to test fine-grained categories knowledge, creating datasets that elicit 22% more model errors (i.e., difficulty) than existing benchmarks.
arXiv Detail & Related papers (2024-07-11T10:03:47Z)
Fennec: Fine-grained Language Model Evaluation and Correction Extended through Branching and Bridging [25.078498180620425]
We present a step-by-step evaluation framework, textbfFennec, capable of textbfFine-grained textbfEvaluatiotextbfN textbfExtended through brantextbfChing and bridging. We employ the fine-grained correction capabilities induced by the evaluation model to refine multiple model responses, leading to an improvement of 1-2 points on the MT-Bench.
arXiv Detail & Related papers (2024-05-20T16:47:22Z)
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. Our findings indicate that MLLMs consistently fall short of human performance on this benchmark. This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z)
Fast and Accurate Factual Inconsistency Detection Over Long Documents [19.86348214462828]
We introduce SCALE, a task-agnostic model for detecting factual inconsistencies using a novel chunking strategy. This approach achieves state-of-the-art performance in factual inconsistency detection for diverse tasks and long inputs. We have released our code and data publicly to GitHub.
arXiv Detail & Related papers (2023-10-19T22:55:39Z)
Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)
Making Retrieval-Augmented Language Models Robust to Irrelevant Context [55.564789967211844]
An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant. Recent work has shown that retrieval augmentation can sometimes have a negative effect on performance.
arXiv Detail & Related papers (2023-10-02T18:52:35Z)
Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation [20.18656308749408]
Large language models (LLMs) have been used for generation and can now output human-like text. This paper investigates how the number of examples in the prompt and the type of example selection used affect the model's performance.
arXiv Detail & Related papers (2023-01-27T22:02:27Z)
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
RethinkCWS: Is Chinese Word Segmentation a Solved Task? [81.11161697133095]
The performance of the Chinese Word (CWS) systems has gradually reached a plateau with the rapid development of deep neural networks. In this paper, we take stock of what we have achieved and rethink what's left in the CWS task.
arXiv Detail & Related papers (2020-11-13T11:07:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.