Related papers: DeCEAT: Decoding Carbon Emissions for AI-driven Software Testing

DeCEAT: Decoding Carbon Emissions for AI-driven Software Testing

URL: http://arxiv.org/abs/2602.18012v1
Date: Fri, 20 Feb 2026 05:54:58 GMT
Title: DeCEAT: Decoding Carbon Emissions for AI-driven Software Testing
Authors: Pragati Kumari, Novarun Deb,
Abstract summary: This work introduces the DeCEAT framework, which systematically evaluates the environmental and performance trade-offs of small language models (SLMs)<n>Our results show that different SLMs exhibit distinct sustainability strengths, while others maintain higher stability or accuracy under carbon constraints.<n>This work provides a focused sustainability evaluation framework specifically tailored to automated SLM-based test generation.
Score: 0.42970700836450487
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The increasing use of language models in automated software testing raises concerns about their environmental impact, yet existing sustainability analyses focus almost exclusively on large language models. As a result, the energy and carbon characteristics of small language models (SLMs) during test generation remain largely unexplored. To address this gap, this work introduces the DeCEAT framework, which systematically evaluates the environmental and performance trade-offs of SLMs using the HumanEval benchmark and adaptive prompt variants (based on the Anthropic template). The framework quantifies emission and time-aware behavior under controlled conditions, with CodeCarbon measuring energy consumption and carbon emissions, and unit test coverage assessing the quality of generated tests. Our results show that different SLMs exhibit distinct sustainability strengths: some prioritize lower energy use and faster execution, while others maintain higher stability or accuracy under carbon constraints. These findings demonstrate that sustainability in the generation of SLM-driven tests is multidimensional and strongly shaped by prompt design. This work provides a focused sustainability evaluation framework specifically tailored to automated SLM-based test generation, clarifying how prompt structure and model choice jointly influence environmental and performance outcomes.

Related papers

AI-CARE: Carbon-Aware Reporting Evaluation Metric for AI Models [2.7946918847372277]
We propose AI-CARE, an evaluation tool for reporting energy consumption, and carbon emissions of machine learning models.<n>We demonstrate, through theoretical analysis and empirical validation, that carbon-aware benchmarking changes the relative ranking of models.<n>Our proposal aims to shift the research community toward transparent, multi-objective evaluation and align ML progress with global sustainability goals.
arXiv Detail & Related papers (2026-02-17T21:52:48Z)
AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition [72.24180896265192]
We introduce AgentNoiseBench, a framework for evaluating robustness of agentic models under noisy environments.<n>We first conduct an in-depth analysis of biases and uncertainties in real-world scenarios.<n>We then categorize environmental noise into two primary types: user-noise and tool-noise.<n>Building on this analysis, we develop an automated pipeline that injects controllable noise into existing agent-centric benchmarks.
arXiv Detail & Related papers (2026-02-11T20:33:10Z)
Emissions and Performance Trade-off Between Small and Large Language Models [1.0863226323853896]
This study investigates the potential of using fine-tuned Small Language Models (SLMs) as a sustainable alternative for predefined tasks.<n>Our results show that in four out of the six selected tasks, SLMs maintained comparable performances for a significant reduction in carbon emissions during inference.
arXiv Detail & Related papers (2025-12-21T07:00:22Z)
Breaking the ICE: Exploring promises and challenges of benchmarks for Inference Carbon & Energy estimation for LLMs [8.377809633825196]
We discuss the challenges of current approaches and present our evolving framework, R-ICE, which estimates prompt level inference carbon emissions.<n>Our promising validation results suggest that benchmark-based modelling holds great potential for inference emission estimation.
arXiv Detail & Related papers (2025-06-10T12:23:02Z)
A Controllable Examination for Long-Context Language Models [62.845852724511964]
This study introduces $textbfLongBioBench, a benchmark for evaluating long-context language models.<n>We show that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results.<n>Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence.
arXiv Detail & Related papers (2025-06-03T14:23:06Z)
Unveiling Environmental Impacts of Large Language Model Serving: A Functional Unit View [2.5832043241251337]
FUEL is a framework for evaluating the environmental impact of large language models (LLMs)<n>We uncover key insights and trade-offs in reducing carbon emissions by optimizing model size, quantization strategy, and hardware choice.
arXiv Detail & Related papers (2025-02-16T20:20:18Z)
CEGI: Measuring the trade-off between efficiency and carbon emissions for SLMs and VLMs [0.0]
This paper analyzes the performance of Small Language Models (SLMs) and Vision Language Models (VLMs)<n>To quantify the trade-off between model performance and carbon emissions, we introduce a novel metric called CEGI (Carbon Efficient Gain Index)<n>Our findings suggest that the marginal gains in accuracy from larger models do not justify the substantial increase in carbon emissions.
arXiv Detail & Related papers (2024-12-03T17:32:47Z)
R-AIF: Solving Sparse-Reward Robotic Tasks from Pixels with Active Inference and World Models [50.19174067263255]
We introduce prior preference learning techniques and self-revision schedules to help the agent excel in sparse-reward, continuous action, goal-based robotic control POMDP environments. We show that our agents offer improved performance over state-of-the-art models in terms of cumulative rewards, relative stability, and success rate.
arXiv Detail & Related papers (2024-09-21T18:32:44Z)
Assessing Generative Language Models in Classification Tasks: Performance and Self-Evaluation Capabilities in the Environmental and Climate Change Domain [0.0]
This paper examines the performance of two Large Language Models (LLMs), GPT3.5 and Llama2 and one Small Language Model (SLM) Gemma, across three different classification tasks within the climate change (CC) and environmental domain.
arXiv Detail & Related papers (2024-08-30T15:52:41Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation [82.85015548989223]
Pentathlon is a benchmark for holistic and realistic evaluation of model efficiency. Pentathlon focuses on inference, which accounts for a majority of the compute in a model's lifecycle. It incorporates a suite of metrics that target different aspects of efficiency, including latency, throughput, memory overhead, and energy consumption.
arXiv Detail & Related papers (2023-07-19T01:05:33Z)
A Comparative Study of Machine Learning Algorithms for Anomaly Detection in Industrial Environments: Performance and Environmental Impact [62.997667081978825]
This study seeks to address the demands of high-performance machine learning models with environmental sustainability. Traditional machine learning algorithms, such as Decision Trees and Random Forests, demonstrate robust efficiency and performance. However, superior outcomes were obtained with optimised configurations, albeit with a commensurate increase in resource consumption.
arXiv Detail & Related papers (2023-07-01T15:18:00Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.