Related papers: Evaluation Framework for Highlight Explanations of Context Utilisation in Language Models

Evaluation Framework for Highlight Explanations of Context Utilisation in Language Models

URL: http://arxiv.org/abs/2510.02629v2
Date: Wed, 22 Oct 2025 16:22:21 GMT
Title: Evaluation Framework for Highlight Explanations of Context Utilisation in Language Models
Authors: Jingyi Sun, Pepa Atanasova, Sagnik Ray Choudhury, Sekh Mainul Islam, Isabelle Augenstein,
Abstract summary: Context utilisation is the ability of Language Models to incorporate relevant information from the provided context when generating responses.<n>We introduce the first gold standard HE evaluation framework for context attribution, using controlled test cases with known ground-truth context usage.<n>We find that MechLight performs best across all context scenarios.
Score: 36.64390220306208
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Context utilisation, the ability of Language Models (LMs) to incorporate relevant information from the provided context when generating responses, remains largely opaque to users, who cannot determine whether models draw from parametric memory or provided context, nor identify which specific context pieces inform the response. Highlight explanations (HEs) offer a natural solution as they can point the exact context pieces and tokens that influenced model outputs. However, no existing work evaluates their effectiveness in accurately explaining context utilisation. We address this gap by introducing the first gold standard HE evaluation framework for context attribution, using controlled test cases with known ground-truth context usage, which avoids the limitations of existing indirect proxy evaluations. To demonstrate the framework's broad applicability, we evaluate four HE methods -- three established techniques and MechLight, a mechanistic interpretability approach we adapt for this task -- across four context scenarios, four datasets, and five LMs. Overall, we find that MechLight performs best across all context scenarios. However, all methods struggle with longer contexts and exhibit positional biases, pointing to fundamental challenges in explanation accuracy that require new approaches to deliver reliable context utilisation explanations at scale.

Related papers

When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs [55.20230501807337]
We present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework.<n>We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset.
arXiv Detail & Related papers (2025-08-15T10:32:50Z)
DICE: A Framework for Dimensional and Contextual Evaluation of Language Models [1.534667887016089]
Language models (LMs) are increasingly being integrated into a wide range of applications.<n>Current evaluations rely on benchmarks that often lack direct applicability to the real-world contexts in which LMs are being deployed.<n>We propose Dimensional and Contextual Evaluation (DICE), an approach that evaluates LMs on granular, context-dependent dimensions.
arXiv Detail & Related papers (2025-04-14T16:08:13Z)
On the Loss of Context-awareness in General Instruction Fine-tuning [101.03941308894191]
We investigate the loss of context awareness after supervised fine-tuning.<n>We find that the performance decline is associated with a bias toward different roles learned during conversational instruction fine-tuning.<n>We propose a metric to identify context-dependent examples from general instruction fine-tuning datasets.
arXiv Detail & Related papers (2024-11-05T00:16:01Z)
Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z)
On Measuring Context Utilization in Document-Level MT Systems [12.02023514105999]
We propose to complement accuracy-based evaluation with measures of context utilization. We show that automatically-annotated supporting context gives similar conclusions to human-annotated context.
arXiv Detail & Related papers (2024-02-02T13:37:07Z)
Can Large Language Models Understand Context? [17.196362853457412]
This paper introduces a context understanding benchmark by adapting existing datasets to suit the evaluation of generative models. Experimental results indicate that pre-trained dense models struggle with understanding more nuanced contextual features when compared to state-of-the-art fine-tuned models. As LLM compression holds growing significance in both research and real-world applications, we assess the context understanding of quantized models under in-context-learning settings.
arXiv Detail & Related papers (2024-02-01T18:55:29Z)
Quantifying the Plausibility of Context Reliance in Neural Machine Translation [25.29330352252055]
We introduce Plausibility Evaluation of Context Reliance (PECoRe) PECoRe is an end-to-end interpretability framework designed to quantify context usage in language models' generations. We use pecore to quantify the plausibility of context-aware machine translation models.
arXiv Detail & Related papers (2023-10-02T13:26:43Z)
Measuring and Increasing Context Usage in Context-Aware Machine Translation [64.5726087590283]
We introduce a new metric, conditional cross-mutual information, to quantify the usage of context by machine translation models. We then introduce a new, simple training method, context-aware word dropout, to increase the usage of context by context-aware models.
arXiv Detail & Related papers (2021-05-07T19:55:35Z)
How Far are We from Effective Context Modeling? An Exploratory Study on Semantic Parsing in Context [59.13515950353125]
We present a grammar-based decoding semantic parsing and adapt typical context modeling methods on top of it. We evaluate 13 context modeling methods on two large cross-domain datasets, and our best model achieves state-of-the-art performances.
arXiv Detail & Related papers (2020-02-03T11:28:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.