Related papers: From Performance to Purpose: A Sociotechnical Taxonomy for Evaluating Large Language Model Utility

From Performance to Purpose: A Sociotechnical Taxonomy for Evaluating Large Language Model Utility

URL: http://arxiv.org/abs/2602.20513v1
Date: Tue, 24 Feb 2026 03:31:07 GMT
Title: From Performance to Purpose: A Sociotechnical Taxonomy for Evaluating Large Language Model Utility
Authors: Gavin Levinson, Keith Feldman,
Abstract summary: We introduce a comprehensive framework that structures utility evaluation across four domains: performance, interaction, operations, and governance.<n> LUX is organized hierarchically into thematically aligned dimensions and components, each grounded in metrics.<n>An external dynamic web tool is provided to support exploration of the framework by connecting each component to a repository of relevant metrics.
Score: 0.2209921757303168
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) continue to improve at completing discrete tasks, they are being integrated into increasingly complex and diverse real-world systems. However, task-level success alone does not establish a model's fit for use in practice. In applied, high-stakes settings, LLM effectiveness is driven by a wider array of sociotechnical determinants that extend beyond conventional performance measures. Although a growing set of metrics capture many of these considerations, they are rarely organized in a way that supports consistent evaluation, leaving no unified taxonomy for assessing and comparing LLM utility across use cases. To address this gap, we introduce the Language Model Utility Taxonomy (LUX), a comprehensive framework that structures utility evaluation across four domains: performance, interaction, operations, and governance. Within each domain, LUX is organized hierarchically into thematically aligned dimensions and components, each grounded in metrics that enable quantitative comparison and alignment of model selection with intended use. In addition, an external dynamic web tool is provided to support exploration of the framework by connecting each component to a repository of relevant metrics (factors) for applied evaluation.

Related papers

Table-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation [11.450834626205676]
Table-BiEval is a novel approach based on a human-free, self-supervised evaluation framework.<n>It calculates Content Semantic Accuracy and Normalized Tree Edit Distance to decouple structure from content.<n>Results reveal substantial variability, highlighting that mid-sized models can surprisingly outperform larger counterparts in structural efficiency.
arXiv Detail & Related papers (2026-01-09T07:38:27Z)
Toward Purpose-oriented Topic Model Evaluation enabled by Large Language Models [0.8193467416247519]
We introduce a purpose-oriented evaluation framework that employs nine Large Language Models (LLMs)-based metrics spanning four key dimensions of topic quality.<n>The framework is validated through adversarial and sampling-based protocols, and is applied across datasets spanning news articles, scholarly publications, and social media posts.
arXiv Detail & Related papers (2025-09-08T18:46:08Z)
Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs [56.76586846269894]
Multimodal Large Language Models (MLLMs) have achieved success across various domains.<n>Despite its importance, the study of knowledge sharing among domain-specific MLLMs remains largely underexplored.<n>We propose a unified parameter integration framework that enables modular composition of expert capabilities.
arXiv Detail & Related papers (2025-06-30T15:07:41Z)
MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models [33.250579401886206]
This paper introduces MCP-RADAR, the first comprehensive benchmark specifically designed to evaluate Large Language Models (LLMs) performance within the Model Context Protocol (MCP) framework.<n> MCP-RADAR features a challenging dataset of 507 tasks spanning six domains: mathematical reasoning, web search, email, calendar, file management, and terminal operations.<n>Unlike traditional benchmarks that rely on subjective human evaluation or binary success metrics, MCP-RADAR adopts objective, quantifiable measurements across multiple task domains.
arXiv Detail & Related papers (2025-05-22T14:02:37Z)
SCAN: Structured Capability Assessment and Navigation for LLMs [54.54085382131134]
textbfSCAN (Structured Capability Assessment and Navigation) is a practical framework that enables detailed characterization of Large Language Models.<n>SCAN incorporates four key components:.<n>TaxBuilder, which extracts capability-indicating tags from queries to construct a hierarchical taxonomy;.<n>RealMix, a query synthesis and filtering mechanism that ensures sufficient evaluation data for each capability tag;.<n>A PC$2$-based (Pre-Comparison-derived Criteria) LLM-as-a-Judge approach achieves significantly higher accuracy compared to classic LLM-as-a-Judge method
arXiv Detail & Related papers (2025-05-10T16:52:40Z)
Beyond Next Word Prediction: Developing Comprehensive Evaluation Frameworks for measuring LLM performance on real world applications [3.686808512438363]
Large Language Models (LLMs) have numerous use-cases, and have already acquired a significant degree of enterprise adoption.<n>This paper provides the basis for a more comprehensive evaluation framework, based upon a traditional game and tool-based architecture.
arXiv Detail & Related papers (2025-03-05T06:44:38Z)
Dynamic benchmarking framework for LLM-based conversational data capture [0.0]
This paper introduces a benchmarking framework to assess large language models (LLMs)<n>It integrates generative agent simulation to evaluate performance on key dimensions: information extraction, context awareness, and adaptive engagement.<n>Results show that adaptive strategies improve data extraction accuracy, especially when handling ambiguous responses.
arXiv Detail & Related papers (2025-02-04T15:47:47Z)
Understanding and Optimizing Agentic Workflows via Shapley value [49.508008396810624]
We introduce ShapleyFlow, the first framework that employs cooperative game theory to analyze and optimize agentic configurations.<n>ShagleyFlow enables fine-grained attribution of each component's contribution and facilitates the identification of task-specific optimal configurations.
arXiv Detail & Related papers (2025-02-01T18:07:34Z)
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making [85.24399869971236]
We aim to evaluate Large Language Models (LLMs) for embodied decision making.<n>Existing evaluations tend to rely solely on a final success rate.<n>We propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks.
arXiv Detail & Related papers (2024-10-09T17:59:00Z)
Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models [95.96734086126469]
Large language models (LLMs) can serve as the assistant to help users accomplish their jobs, and also support the development of advanced applications. For the wide application of LLMs, the inference efficiency is an essential concern, which has been widely studied in existing work. We perform a detailed coarse-to-fine analysis of the inference performance of various code libraries.
arXiv Detail & Related papers (2024-04-17T15:57:50Z)
MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks [56.60050181186531]
We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions. Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
arXiv Detail & Related papers (2023-10-13T11:57:04Z)
Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet Extraction [67.54420015049732]
Aspect Sentiment Triplet Extraction (ASTE) is a challenging task in sentiment analysis, aiming to provide fine-grained insights into human sentiments. Existing benchmarks are limited to two domains and do not evaluate model performance on unseen domains. We introduce a domain-expanded benchmark by annotating samples from diverse domains, enabling evaluation of models in both in-domain and out-of-domain settings.
arXiv Detail & Related papers (2023-05-23T18:01:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.