Related papers: RobuNFR: Evaluating the Robustness of Large Language Models on Non-Functional Requirements Aware Code Generation

RobuNFR: Evaluating the Robustness of Large Language Models on Non-Functional Requirements Aware Code Generation

URL: http://arxiv.org/abs/2503.22851v2
Date: Thu, 03 Apr 2025 00:55:35 GMT
Title: RobuNFR: Evaluating the Robustness of Large Language Models on Non-Functional Requirements Aware Code Generation
Authors: Feng Lin, Dong Jae Kim, Zhenhao Li, Jinqiu Yang, Tse-Hsun, Chen,
Abstract summary: We propose RobuNFR for evaluating the robustness of LLMs in NFR-aware code generation.<n>Our experiments show that RobuNFR reveals issues in the tested LLMs when considering NFRs in code generation.
Score: 52.87427601131587
License: http://creativecommons.org/licenses/by/4.0/
Abstract: When using LLMs to address Non-Functional Requirements (NFRs), developers may behave differently (e.g., expressing the same NFR in different words). Robust LLMs should output consistent results across these variations; however, this aspect remains underexplored. We propose RobuNFR for evaluating the robustness of LLMs in NFR-aware code generation across four NFR dimensions: design, readability, reliability, and performance, using three methodologies: prompt variation, regression testing, and diverse workflows. Our experiments show that RobuNFR reveals robustness issues in the tested LLMs when considering NFRs in code generation. Specifically, under prompt variation, including NFRs leads to a decrease in Pass@1 by up to 39 percent and an increase in the standard deviation from 0.48 to 2.48 compared to the baseline without NFRs (i.e., Function-Only). While incorporating NFRs generally improves overall NFR metrics, it also results in higher prompt sensitivity. In regression settings, some LLMs exhibit differences across versions, with improvements in one aspect (e.g., reduced code smells) often accompanied by regressions in another (e.g., decreased correctness), revealing inconsistencies that challenge their robustness. When varying workflows, the tested LLMs show significantly different NFR-aware code generation capabilities between two workflows: (1) integrating NFRs and functional requirements into the initial prompt and (2) enhancing Function-Only-generated code with the same NFR.

Related papers

Improving Code LLM Robustness to Prompt Perturbations via Layer-Aware Model Editing [13.099973383252452]
Large language models (LLMs) are highly sensitive to prompt perturbations.<n>We introduce CREME, a novel approach that enhances LLM robustness through targeted parameter updates.<n> Experimental results show that CREME improves Pass@1 accuracy by 63% on perturbed prompts.
arXiv Detail & Related papers (2025-07-22T09:57:55Z)
Enhancing the Robustness of LLM-Generated Code: Empirical Study and Framework [25.793118619876513]
RobGen is a framework designed to enhance code robustness without requiring model retraining.<n>RobGen reduces the proportion of less robust model-generated code by 20.0%.
arXiv Detail & Related papers (2025-03-26T03:44:03Z)
The Power of Negative Zero: Datatype Customization for Quantized Large Language Models [5.503925076208333]
Post-training quantization serves as one of the most hardware-efficient methods to mitigate the memory and computational demands of large language models (LLMs) In this paper, we extend the basic FP datatype to perform Redundant Zero Remapping (RaZeR) RaZeR remaps the negative zero FP encoding to a set of pre-defined special values to maximally utilize FP quantization encodings and to better fit numerical distributions.
arXiv Detail & Related papers (2025-01-06T22:40:40Z)
SFR-RAG: Towards Contextually Faithful LLMs [57.666165819196486]
Retrieval Augmented Generation (RAG) is a paradigm that integrates external contextual information with large language models (LLMs) to enhance factual accuracy and relevance. We introduce SFR-RAG, a small LLM that is instruction-textual with an emphasis on context-grounded generation and hallucination. We also present ConBench, a new evaluation framework compiling multiple popular and diverse RAG benchmarks.
arXiv Detail & Related papers (2024-09-16T01:08:18Z)
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications. FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z)
Applying RLAIF for Code Generation with API-usage in Lightweight LLMs [15.366324461797582]
Reinforcement Learning from AI Feedback (RLAIF) has demonstrated significant potential across various domains. This paper introduces an RLAIF framework for improving the code generation abilities of lightweight (1B parameters) LLMs.
arXiv Detail & Related papers (2024-06-28T17:16:03Z)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z)
LLMs for Generating and Evaluating Counterfactuals: A Comprehensive Study [2.7731115923558143]
Large Language Models (LLMs) have shown remarkable performance in NLP tasks, but their efficacy in generating high-quality Counterfactuals (CFs) remains uncertain. We compare several common LLMs and evaluate their CFs, assessing both intrinsic metrics, and the impact of these CFs on data augmentation. Our results show that LLMs generate fluent CFs, but struggle to keep the induced changes minimal.
arXiv Detail & Related papers (2024-04-26T11:57:21Z)
FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z)
Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation [31.657608562937543]
We introduce GRIFFIN, a training-free and calibration-free method that selects unique FF experts at the sequence level for efficient generation. GRIFFIN maintains the original model's performance with little to no degradation on a variety of classification and generation tasks.
arXiv Detail & Related papers (2024-04-01T17:56:06Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
Randomized Smoothing with Masked Inference for Adversarially Robust Text Classifications [3.6048665052465667]
We introduce RSMI, a novel two-stage framework that combines randomized smoothing (RS) with masked inference (MI) to improve the adversarial robustness of NLP systems. RS transforms a classifier into a smoothed classifier to obtain robust representations, whereas MI forces a model to exploit the surrounding context of a masked token in an input sequence. RSMI improves adversarial robustness by 2 to 3 times over existing state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-05-11T01:50:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.