Related papers: Bias Testing and Mitigation in LLM-based Code Generation

Bias Testing and Mitigation in LLM-based Code Generation

URL: http://arxiv.org/abs/2309.14345v3
Date: Fri, 24 May 2024 13:03:49 GMT
Title: Bias Testing and Mitigation in LLM-based Code Generation
Authors: Dong Huang, Qingwen Bu, Jie Zhang, Xiaofei Xie, Junjie Chen, Heming Cui,
Abstract summary: This paper presents a novel bias testing framework specifically designed for code generation tasks. Our findings reveal that 20.29% to 44.93% code functions generated by the models under study are biased when handling bias sensitive tasks. To mitigate bias for code generation models, we evaluate five bias mitigation prompt strategies.
Score: 23.787124657688267
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Utilizing state-of-the-art Large Language Models (LLMs), automatic code generation models play a pivotal role in enhancing the productivity of software development procedures. As the adoption of LLMs becomes more widespread in software coding ecosystems, a pressing issue has emerged: does the generated code contain social bias and unfairness, such as those related to age, gender, and race? This issue concerns the integrity, fairness, and ethical foundation of software applications that depend on the code generated by these models, yet is under-explored in the literature. This paper presents a novel bias testing framework that is specifically designed for code generation tasks. Based on this framework, we conduct an extensive evaluation of the bias in code generated by five state-of-the-art LLMs. Our findings reveal that 20.29% to 44.93% code functions generated by the models under study are biased when handling bias sensitive tasks (i.e., tasks that involve sensitive attributes such as age and gender). This indicates that the existing LLMs can be unfair in code generation, posing risks of unintended and harmful software behaviors. To mitigate bias for code generation models, we evaluate five bias mitigation prompt strategies, i.e., utilizing bias testing results to refine the code (zero-shot), one-, few-shot, and two Chain-of-Thought (CoT) prompts. Our evaluation results illustrate that these strategies are all effective in mitigating bias. Overall, one-shot and few-shot learning are the two most effective. For GPT-4, 80% to 90% code bias can be removed with one-shot learning.

Related papers

From Bias To Improved Prompts: A Case Study of Bias Mitigation of Clone Detection Models [5.874997638802244]
We assess the suitability of Generative Large Language Models for clone code detection.<n>A known issue with LLMs is their susceptibility to prompt bias, where the performance of these models fluctuates based on the input prompt provided.<n>Our analysis identifies eight distinct categories of prompt bias, and our devised approach leveraging these biases yields a significant improvement of up to 10.81% in the F1 score.
arXiv Detail & Related papers (2025-05-08T22:38:10Z)
Comparing Human and LLM Generated Code: The Jury is Still Out! [8.456554883523472]
We compare the effectiveness of large language models (LLMs) and human programmers in producing Python software code. We use various static analysis benchmarks, including Pylint, Radon, Bandit and test cases. We observe security flaws in code generated by both humans and GPT-4, but GPT-4 code included more severe outliers.
arXiv Detail & Related papers (2025-01-28T11:11:36Z)
FairCoder: Evaluating Social Bias of LLMs in Code Generation [25.358230310973248]
We introduce FairCoder, a novel benchmark for evaluating social bias in code generation. Three metrics are designed to assess fairness performance on this benchmark. The findings reveal that all tested LLMs exhibit social bias.
arXiv Detail & Related papers (2025-01-09T17:42:23Z)
Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar [15.421030528350212]
We build a code-obfuscation based benchmark OBFUSEVAL to evaluate large language models. We use three-level strategy to obfuscate descriptions, code and context dependencies. The results show that after obfuscation, the average decrease ratio of test pass rate can up to 62.5%.
arXiv Detail & Related papers (2024-12-11T05:31:39Z)
Comparing Robustness Against Adversarial Attacks in Code Generation: LLM-Generated vs. Human-Written [11.16693333878553]
This paper introduces an empirical study to evaluate the adversarial robustness of Pre-trained Models of Code (PTMCs) fine-tuned on code written by humans. We consider two datasets, two state-of-the-art PTMCs, two robustness evaluation criteria, and three metrics to use in our experiments.
arXiv Detail & Related papers (2024-11-15T20:25:32Z)
A Comprehensive Survey of AI-Driven Advancements and Techniques in Automated Program Repair and Code Generation [0.0]
27 recent papers have been reviewed and split into two groups. The first group consists of new methods for bug detection and repair, which include locating semantic errors. The second group dwells on code generation, providing an overview of both general-purpose LLMs fine-tuned for programming and task-specific models. It also presents methods to improve code generation, such as identifier-aware training, fine-tuning at the instruction level, and incorporating semantic code structures.
arXiv Detail & Related papers (2024-11-12T06:47:54Z)
A Deep Dive Into Large Language Model Code Generation Mistakes: What and Why? [9.246899995643918]
Large Language Models can still generate defective code that deviates from the specification. Seven categories of non-syntactic mistakes were identified through extensive manual analyses. Our evaluation demonstrated that GPT-4 with the ReAct prompting technique can achieve an F1 score of up to 0.65 when identifying reasons for LLM's mistakes.
arXiv Detail & Related papers (2024-11-03T02:47:03Z)
$\mathbb{USCD}$: Improving Code Generation of LLMs by Uncertainty-Aware Selective Contrastive Decoding [64.00025564372095]
Large language models (LLMs) have shown remarkable capabilities in code generation. The effects of hallucinations (e.g., output noise) make it challenging for LLMs to generate high-quality code in one pass. We propose a simple and effective textbfuncertainty-aware textbfselective textbfcontrastive textbfdecoding.
arXiv Detail & Related papers (2024-09-09T02:07:41Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants. Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts.
arXiv Detail & Related papers (2024-05-25T08:57:28Z)
Comments as Natural Logic Pivots: Improve Code Generation via Comment Perspective [85.48043537327258]
We propose MANGO (comMents As Natural loGic pivOts), including a comment contrastive training strategy and a corresponding logical comment decoding strategy. Results indicate that MANGO significantly improves the code pass rate based on the strong baselines. The robustness of the logical comment decoding strategy is notably higher than the Chain-of-thoughts prompting.
arXiv Detail & Related papers (2024-04-11T08:30:46Z)
Reasoning Runtime Behavior of a Program with LLM: How Far Are We? [25.451857140926943]
Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. Code reasoning is one of the most essential abilities of code LLMs. We propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution.
arXiv Detail & Related papers (2024-03-25T05:37:16Z)
GPTBIAS: A Comprehensive Framework for Evaluating Bias in Large Language Models [83.30078426829627]
Large language models (LLMs) have gained popularity and are being widely adopted by a large user community. The existing evaluation methods have many constraints, and their results exhibit a limited degree of interpretability. We propose a bias evaluation framework named GPTBIAS that leverages the high performance of LLMs to assess bias in models.
arXiv Detail & Related papers (2023-12-11T12:02:14Z)
Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation. We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z)
Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code? [10.249771123421432]
We investigate whether Large Language Models (LLMs) attend to the same parts of a task description as human programmers during code generation. We manually analyzed 211 incorrect code snippets and found five attention patterns that can be used to explain many code generation errors. Our findings highlight the need for human-aligned LLMs for better interpretability and programmer trust.
arXiv Detail & Related papers (2023-06-02T00:57:03Z)
ReCode: Robustness Evaluation of Code Generation Models [90.10436771217243]
We propose ReCode, a comprehensive robustness evaluation benchmark for code generation models. We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format. With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt.
arXiv Detail & Related papers (2022-12-20T14:11:31Z)
Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)
Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.