Evaluating LLM Metrics Through Real-World Capabilities
- URL: http://arxiv.org/abs/2505.08253v1
- Date: Tue, 13 May 2025 06:02:37 GMT
- Title: Evaluating LLM Metrics Through Real-World Capabilities
- Authors: Justin K Miller, Wenjia Tang,
- Abstract summary: We analyze large-scale survey data and usage logs to identify six core capabilities that represent how people commonly use Large Language Models (LLMs)<n>We then assess the extent to which existing benchmarks cover these capabilities, revealing significant gaps in coverage, efficiency measurement, and interpretability.<n>For four of the six capabilities, we identify the benchmarks that best align with real-world tasks and use them to compare leading models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As generative AI becomes increasingly embedded in everyday workflows, it is important to evaluate its performance in ways that reflect real-world usage rather than abstract notions of intelligence. Unlike many existing benchmarks that assess general intelligence, our approach focuses on real-world utility, evaluating how well models support users in everyday tasks. While current benchmarks emphasize code generation or factual recall, users rely on AI for a much broader range of activities-from writing assistance and summarization to citation formatting and stylistic feedback. In this paper, we analyze large-scale survey data and usage logs to identify six core capabilities that represent how people commonly use Large Language Models (LLMs): Summarization, Technical Assistance, Reviewing Work, Data Structuring, Generation, and Information Retrieval. We then assess the extent to which existing benchmarks cover these capabilities, revealing significant gaps in coverage, efficiency measurement, and interpretability. Drawing on this analysis, we use human-centered criteria to identify gaps in how well current benchmarks reflect common usage that is grounded in five practical criteria: coherence, accuracy, clarity, relevance, and efficiency. For four of the six capabilities, we identify the benchmarks that best align with real-world tasks and use them to compare leading models. We find that Google Gemini outperforms other models-including OpenAI's GPT, xAI's Grok, Meta's LLaMA, Anthropic's Claude, DeepSeek, and Qwen from Alibaba-on these utility-focused metrics.
Related papers
- ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation [48.24550684610705]
ArtifactsBench is a framework for automated visual code generation evaluation.<n>Our framework renders each generated artifact and captures its dynamic behavior through temporal screenshots.<n>We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading Large Language Models.
arXiv Detail & Related papers (2025-07-07T12:53:00Z) - OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks [77.19223035769248]
Recent breakthroughs in large multimodal models (LMMs) have demonstrated remarkable proficiency in following general-purpose instructions for image generation.<n>We introduce OmniGenBench, a novel benchmark meticulously designed to assess the instruction-following abilities of state-of-the-art LMMs.<n>Our OmniGenBench includes 57 diverse sub-tasks grounded in real-world scenarios, systematically categorized according to the specific model capabilities they demand.
arXiv Detail & Related papers (2025-05-24T16:29:34Z) - Human Re-ID Meets LVLMs: What can we expect? [14.370360290704197]
We compare the performance of the leading large vision-language models in the human re-identification task.<n>Our results confirm the strengths of LVLMs, but also their severe limitations that often lead to catastrophic answers.
arXiv Detail & Related papers (2025-01-30T19:00:40Z) - Improving LLM Leaderboards with Psychometrical Methodology [0.0]
The rapid development of large language models (LLMs) has necessitated the creation of benchmarks to evaluate their performance.<n>These benchmarks resemble human tests and surveys, as they consist of questions designed to measure emergent properties in the cognitive behavior of these systems.<n>However, unlike the well-defined traits and abilities studied in social sciences, the properties measured by these benchmarks are often vaguer and less rigorously defined.
arXiv Detail & Related papers (2025-01-27T21:21:46Z) - LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.<n>LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.<n>We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.<n>A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers.
These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult.
We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Mapping global dynamics of benchmark creation and saturation in
artificial intelligence [5.233652342195164]
We create maps of the global dynamics of benchmark creation and saturation.
We curated data for 1688 benchmarks covering the entire domains of computer vision and natural language processing.
arXiv Detail & Related papers (2022-03-09T09:16:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.