Objective Metrics for Evaluating Large Language Models Using External Data Sources
- URL: http://arxiv.org/abs/2508.08277v1
- Date: Fri, 01 Aug 2025 02:24:19 GMT
- Title: Objective Metrics for Evaluating Large Language Models Using External Data Sources
- Authors: Haoze Du, Richard Li, Edward Gehringer,
- Abstract summary: This paper proposes a framework for leveraging subjective metrics derived from the class textual materials across different semesters.<n>The framework emphasizes automation and transparency in scoring, reducing reliance on human interpretation.<n>This method addresses the limitations of subjective evaluation methods, providing a scalable solution for performance assessment in educational, scientific, and other high-stakes domains.
- Score: 4.574672973076743
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating the performance of Large Language Models (LLMs) is a critical yet challenging task, particularly when aiming to avoid subjective assessments. This paper proposes a framework for leveraging subjective metrics derived from the class textual materials across different semesters to assess LLM outputs across various tasks. By utilizing well-defined benchmarks, factual datasets, and structured evaluation pipelines, the approach ensures consistent, reproducible, and bias-minimized measurements. The framework emphasizes automation and transparency in scoring, reducing reliance on human interpretation while ensuring alignment with real-world applications. This method addresses the limitations of subjective evaluation methods, providing a scalable solution for performance assessment in educational, scientific, and other high-stakes domains.
Related papers
- Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning [63.531262595858]
Divide-and-conquer approach breaks comprehensive evaluation task into localized scoring tasks, followed by a final global assessment.<n>We introduce a hybrid in-context learning approach that leverages human annotations to enhance the performance of both local and global evaluations.<n>Finally, we develop an uncertainty-based active learning algorithm that efficiently selects data samples for human annotation.
arXiv Detail & Related papers (2025-05-26T16:39:41Z) - Meta-Evaluating Local LLMs: Rethinking Performance Metrics for Serious Games [3.725822359130832]
Large Language Models (LLMs) are increasingly being explored as evaluators in serious games.<n>This study investigates the reliability of five small-scale LLMs when assessing player responses in textitEn-join, a game that simulates decision-making within energy communities.<n>Our results highlight the strengths and limitations of each model, revealing trade-offs between sensitivity, specificity, and overall performance.
arXiv Detail & Related papers (2025-04-13T10:46:13Z) - StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.<n> Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.<n>We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z) - The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? [1.3810901729134184]
Large Language Models (LLMs) excel at standardized tests while failing to demonstrate genuine language understanding and adaptability.<n>Our systematic analysis of NLP evaluation frameworks reveals pervasive vulnerabilities across the evaluation spectrum.<n>We lay the groundwork for new evaluation methods that resist manipulation, minimize data contamination, and assess domain-specific tasks.
arXiv Detail & Related papers (2024-12-02T20:49:21Z) - Towards More Effective Table-to-Text Generation: Assessing In-Context Learning and Self-Evaluation with Open-Source Models [0.0]
This study explores the effectiveness of various in-context learning strategies in language models (LMs) across benchmark datasets.
We employ a large language model (LLM) self-evaluation approach using chain-of-thought reasoning and assess its correlation with human-aligned metrics like BERTScore.
Our findings highlight the significant impact of examples in improving table-to-text generation and suggest that, while LLM self-evaluation has potential, its current alignment with human judgment could be enhanced.
arXiv Detail & Related papers (2024-10-15T09:19:42Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.