Related papers: SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation

SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation

URL: http://arxiv.org/abs/2602.10017v1
Date: Tue, 10 Feb 2026 17:39:17 GMT
Title: SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation
Authors: Homaira Huda Shomee, Rochana Chaturvedi, Yangxinyu Xie, Tanwi Mallick,
Abstract summary: Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings.<n>We propose a multi-dimensional, reference-free evaluation framework that assesses LLM outputs along four complementary dimensions.<n>We introduce a curated dataset of 1,412 domain-specific question-answer pairs spanning 40 professional roles and seven natural hazard types.
Score: 6.760582976667912
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings such as natural hazard response and infrastructure planning, where effective answers must convey fine-grained, decision-critical details. However, existing evaluation frameworks for retrieval-augmented generation (RAG) and open-ended question answering primarily rely on surface-level similarity, factual consistency, or semantic relevance, and often fail to assess whether responses provide the specific information required for domain-sensitive decisions. To address this gap, we propose a multi-dimensional, reference-free evaluation framework that assesses LLM outputs along four complementary dimensions: specificity, robustness to paraphrasing and semantic perturbations, answer relevance, and context utilization. We introduce a curated dataset of 1,412 domain-specific question-answer pairs spanning 40 professional roles and seven natural hazard types to support systematic evaluation. We further conduct human evaluation to assess inter-annotator agreement and alignment between model outputs and human judgments, which highlights the inherent subjectivity of open-ended, domain-specific evaluation. Our results show that no single metric sufficiently captures answer quality in isolation and demonstrate the need for structured, multi-metric evaluation frameworks when deploying LLMs in high-stakes applications.

Related papers

Multimodal Fact-Level Attribution for Verifiable Reasoning [80.60864342985748]
Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation.<n>Existing multimodal grounding benchmarks and evaluation methods fail to assess attribution in complex multimodal reasoning.<n>We introduce MuRGAt, a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation.
arXiv Detail & Related papers (2026-02-12T03:10:02Z)
Automated Benchmark Generation from Domain Guidelines Informed by Bloom's Taxonomy [28.293009223912602]
Open-ended question answering (QA) evaluates a model's ability to perform contextualized reasoning beyond factual recall.<n>This challenge is especially acute in practice-based domains, where knowledge is procedural and grounded in professional judgment.<n>We introduce a framework for automated benchmark generation from expert-authored guidelines informed by Bloom's taxonomy.
arXiv Detail & Related papers (2026-01-28T05:01:11Z)
Towards Human-Like Grading: A Unified LLM-Enhanced Framework for Subjective Question Evaluation [11.709100855086291]
We propose a unified Large Language Model (LLM)-enhanced auto-grading framework that provides human-like evaluation for all types of subjective questions.<n>Our framework integrates four complementary modules to holistically evaluate student answers.
arXiv Detail & Related papers (2025-10-09T08:05:39Z)
Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework [2.102846336724103]
Retrieval-augmented generation (RAG) systems improve large language model outputs by incorporating external knowledge, enabling more informed and context-aware responses.<n>This work introduces a novel multi-agent framework for generating synthetic QA datasets for RAG evaluation that prioritize semantic diversity and privacy preservation.
arXiv Detail & Related papers (2025-08-26T11:16:14Z)
Expert Preference-based Evaluation of Automated Related Work Generation [54.29459509574242]
We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences.<n>For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs.
arXiv Detail & Related papers (2025-08-11T13:08:07Z)
Teaching Language Models To Gather Information Proactively [53.85419549904644]
Large language models (LLMs) are increasingly expected to function as collaborative partners.<n>In this work, we introduce a new task paradigm: proactive information gathering.<n>We design a scalable framework that generates partially specified, real-world tasks, masking key information.<n>Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information.
arXiv Detail & Related papers (2025-07-28T23:50:09Z)
Disambiguation in Conversational Question Answering in the Era of LLMs and Agents: A Survey [54.90240495777929]
Ambiguity remains a fundamental challenge in Natural Language Processing (NLP)<n>With the advent of Large Language Models (LLMs), addressing ambiguity has become even more critical due to their expanded capabilities and applications.<n>This paper explores the definition, forms, and implications of ambiguity for language driven systems.
arXiv Detail & Related papers (2025-05-18T20:53:41Z)
A Task-Centric Perspective on Recommendation Systems [32.44458308850838]
We analyze RecSys task formulations, emphasizing key components such as input-output structures, temporal dynamics, and candidate item selection.<n>We explore the balance between task specificity and model generalizability, highlighting how well-defined task formulations serve as the foundation for robust evaluation and effective solution development.
arXiv Detail & Related papers (2025-03-27T06:10:22Z)
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain.<n>Our benchmark is characterized by its multi-dimensional evaluation framework.<n>Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z)
Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making [1.3812010983144802]
This study evaluates large language models (LLMs) across diverse domains, including cybersecurity, medicine, and finance. The results indicate that model size and types of prompts used for inference significantly influenced response length and quality.
arXiv Detail & Related papers (2024-06-25T20:52:31Z)
HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition [92.17397504834825]
HD-Eval is a framework that iteratively aligns large language models evaluators with human preference. HD-Eval inherits the essence from the evaluation mindset of human experts and enhances the alignment of LLM-based evaluators. Extensive experiments on three evaluation domains demonstrate the superiority of HD-Eval in further aligning state-of-the-art evaluators.
arXiv Detail & Related papers (2024-02-24T08:01:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.