Smart but Costly? Benchmarking LLMs on Functional Accuracy and Energy Efficiency
- URL: http://arxiv.org/abs/2511.07698v1
- Date: Wed, 12 Nov 2025 01:12:07 GMT
- Title: Smart but Costly? Benchmarking LLMs on Functional Accuracy and Energy Efficiency
- Authors: Mohammadjavad Mehditabar, Saurabhsingh Rajput, Antonio Mastropaolo, Tushar Sharma,
- Abstract summary: We present a framework, BRACE, to benchmark Code Language Models on a unified scale of energy efficiency and functional correctness.<n>We propose two rating methods: Concentric Incremental Rating Circles (CIRC) and Observation to Expectation Rating (OTER)<n>Our analysis reveals models generally perform better in the code summarization tasks as they are not enforced to generate a grammar-based and syntactically correct output.
- Score: 5.771786260272727
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid advancement of AI technologies and their accelerated adoption in software development necessitates a systematic evaluation of their environmental impact alongside functional correctness. While prior studies have examined sustainability in large language models, existing approaches lack systematic frameworks for evaluating accuracy-energy trade-offs in Code Language Models (CLMs). In this paper, we present a framework, BRACE, to benchmark CLMs on a unified scale of energy efficiency and functional correctness (referred to as accuracy). We benchmark 22 state-of-the-art models on code generation and summarization tasks, proposing two rating methods: Concentric Incremental Rating Circles (CIRC) and Observation to Expectation Rating (OTER). CIRC provides deterministic Euclidean-based rankings with static trade-offs that are robust to outliers, and OTER offers trend-aware evaluation with dynamic trade-offs that capture the complex correlation between energy and accuracy, each offering a distinct perspective and addressing the problem in a unique way. These rating methods enable us to rate LLMs on a 1-5 scale reflecting their combined capabilities in terms of energy efficiency and functional correctness. Our analysis reveals models generally perform better in the code summarization tasks as they are not enforced to generate a grammar-based and syntactically correct output. Also, we find that models' size does not have a significant impact on their ratings, indicating that if models utilize their parameters efficiently, they can be ranked higher on these scales. The proposed BRACE framework empowers practitioners to make evidence-based model selections that balance sustainability with task requirements, guiding rating choice -- CIRC for deterministic comparisons or OTER for trend-aware evaluation -- based on deployment priorities.
Related papers
- CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria [48.70940362676624]
We propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method.<n>Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks.
arXiv Detail & Related papers (2026-01-28T07:46:13Z) - Metrics and evaluations for computational and sustainable AI efficiency [26.52588349722099]
Current approaches fail to provide a holistic view, making it difficult to compare and optimise systems.<n>We propose a unified and reproducible methodology for AI model inference that integrates computational and environmental metrics.<n>Our framework provides pragmatic, carbon-aware evaluation by systematically measuring latency and distributions throughput, energy consumption, and location-adjusted carbon emissions.
arXiv Detail & Related papers (2025-10-18T03:30:15Z) - A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs [0.0]
This paper presents a framework for benchmarking Large Language Models (LLMs) on the task of classifying complex industrial records.<n>To promote transparency and encourage further research, this framework has been made publicly available as an open-source tool.<n>We quantify a clear performance hierarchy, identifying top models that exhibit high alignment with a benchmark standard and trustworthy, well-calibrated confidence scores.
arXiv Detail & Related papers (2025-09-08T15:48:17Z) - RoHOI: Robustness Benchmark for Human-Object Interaction Detection [84.78366452133514]
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support.<n>We introduce the first benchmark for HOI detection, evaluating model resilience under diverse challenges.<n>Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric.
arXiv Detail & Related papers (2025-07-12T01:58:04Z) - HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization [31.908590128913094]
HeuriGym is an agentic framework designed for evaluating algorithms generated by Large Language Models (LLMs)<n>We evaluate nine state-of-the-art models on nine problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use, planning, and adaptive reasoning.<n>Our open-source benchmark aims to guide the development of LLMs toward more effective and realistic problem-solving in scientific and engineering domains.
arXiv Detail & Related papers (2025-06-09T17:46:47Z) - Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling [41.19330514054401]
Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness.<n>We propose the Explicit Knowledge Boundary Modeling framework to integrate fast and slow reasoning systems to harmonize reliability and usability.
arXiv Detail & Related papers (2025-03-04T03:16:02Z) - Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z) - Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.