Related papers: DICE: A Framework for Dimensional and Contextual Evaluation of Language Models

DICE: A Framework for Dimensional and Contextual Evaluation of Language Models

URL: http://arxiv.org/abs/2504.10359v1
Date: Mon, 14 Apr 2025 16:08:13 GMT
Title: DICE: A Framework for Dimensional and Contextual Evaluation of Language Models
Authors: Aryan Shrivastava, Paula Akemi Aoyagui,
Abstract summary: Language models (LMs) are increasingly being integrated into a wide range of applications.<n>Current evaluations rely on benchmarks that often lack direct applicability to the real-world contexts in which LMs are being deployed.<n>We propose Dimensional and Contextual Evaluation (DICE), an approach that evaluates LMs on granular, context-dependent dimensions.
Score: 1.534667887016089
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models (LMs) are increasingly being integrated into a wide range of applications, yet the modern evaluation paradigm does not sufficiently reflect how they are actually being used. Current evaluations rely on benchmarks that often lack direct applicability to the real-world contexts in which LMs are being deployed. To address this gap, we propose Dimensional and Contextual Evaluation (DICE), an approach that evaluates LMs on granular, context-dependent dimensions. In this position paper, we begin by examining the insufficiency of existing LM benchmarks, highlighting their limited applicability to real-world use cases. Next, we propose a set of granular evaluation parameters that capture dimensions of LM behavior that are more meaningful to stakeholders across a variety of application domains. Specifically, we introduce the concept of context-agnostic parameters - such as robustness, coherence, and epistemic honesty - and context-specific parameters that must be tailored to the specific contextual constraints and demands of stakeholders choosing to deploy LMs into a particular setting. We then discuss potential approaches to operationalize this evaluation framework, finishing with the opportunities and challenges DICE presents to the LM evaluation landscape. Ultimately, this work serves as a practical and approachable starting point for context-specific and stakeholder-relevant evaluation of LMs.

Related papers

Audit, Alignment, and Optimization of LM-Powered Subroutines with Application to Public Comment Processing [2.0417058495510374]
We propose a framework for declaring LM-powered subroutines for use within conventional asynchronous code.<n>We use this framework to develop "CommentNEPA," an application that compiles, organizes, and summarizes a corpus of public commentary submitted in response to a project requiring environmental review.
arXiv Detail & Related papers (2025-07-10T18:52:09Z)
Meta-Evaluating Local LLMs: Rethinking Performance Metrics for Serious Games [3.725822359130832]
Large Language Models (LLMs) are increasingly being explored as evaluators in serious games. This study investigates the reliability of five small-scale LLMs when assessing player responses in textitEn-join, a game that simulates decision-making within energy communities. Our results highlight the strengths and limitations of each model, revealing trade-offs between sensitivity, specificity, and overall performance.
arXiv Detail & Related papers (2025-04-13T10:46:13Z)
Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications [0.0]
Large Language Models (LLMs) have demonstrated impressive performance across diverse domains, yet they still encounter challenges such as insufficient domain-specific knowledge, biases, and hallucinations. Traditional evaluation methods, which rely on word overlap or text embeddings, are inadequate for capturing the nuanced semantic information necessary to evaluate dynamic, open-ended text generation. We propose a novel dynamic multi-agent system that automatically designs personalized LLM judges for various natural language generation applications.
arXiv Detail & Related papers (2025-04-01T09:36:56Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
LocalValueBench: A Collaboratively Built and Extensible Benchmark for Evaluating Localized Value Alignment and Ethical Safety in Large Language Models [0.0]
The proliferation of large language models (LLMs) requires robust evaluation of their alignment with local values and ethical standards. textscLocalValueBench is a benchmark designed to assess LLMs' adherence to Australian values.
arXiv Detail & Related papers (2024-07-27T05:55:42Z)
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.<n>A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z)
RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z)
Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework. We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z)
A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators [46.939611070781794]
Large language models (LLMs) are shown to be promising substitutes for human judges. We analyze the multi-dimensional evaluation capability of 30 recently emerged LLMs at both turn and dialogue levels. We also probe the robustness of the LLMs in handling various adversarial perturbations at both turn and dialogue levels.
arXiv Detail & Related papers (2023-12-24T04:50:57Z)
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [49.500322937449326]
Multimodal large language models (MLLMs) have broadened the scope of AI applications. Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences. We propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge.
arXiv Detail & Related papers (2023-11-23T12:04:25Z)
LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores [23.568883428947494]
We investigate whether prominent LM-based evaluation metrics demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks. Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in a reference-free manner without leveraging gold summaries. These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality.
arXiv Detail & Related papers (2023-11-16T10:43:26Z)
Evaluation Gaps in Machine Learning Practice [13.963766987258161]
In practice, evaluations of machine learning models frequently focus on a narrow range of decontextualized predictive behaviours. We examine the evaluation gaps between the idealized breadth of evaluation concerns and the observed narrow focus of actual evaluations. By studying these properties, we demonstrate the machine learning discipline's implicit assumption of a range of commitments which have normative impacts.
arXiv Detail & Related papers (2022-05-11T04:00:44Z)
Towards a multi-stakeholder value-based assessment framework for algorithmic systems [76.79703106646967]
We develop a value-based assessment framework that visualizes closeness and tensions between values. We give guidelines on how to operationalize them, while opening up the evaluation and deliberation process to a wide range of stakeholders.
arXiv Detail & Related papers (2022-05-09T19:28:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.