Related papers: A Simple, Yet Effective Approach to Finding Biases in Code Generation

A Simple, Yet Effective Approach to Finding Biases in Code Generation

URL: http://arxiv.org/abs/2211.00609v2
Date: Tue, 9 May 2023 14:47:24 GMT
Title: A Simple, Yet Effective Approach to Finding Biases in Code Generation
Authors: Spyridon Mouselinos, Mateusz Malinowski, Henryk Michalewski
Abstract summary: This work shows that current code generation systems exhibit undesired biases inherited from their large language model backbones. We propose the "block of influence" concept, which enables a modular decomposition and analysis of the coding challenges.
Score: 16.094062131137722
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recently, high-performing code generation systems based on large language models have surfaced. They are trained on massive corpora containing much more natural text than actual executable computer code. This work shows that current code generation systems exhibit undesired biases inherited from their large language model backbones, which can reduce the quality of the generated code under specific circumstances. To investigate the effect, we propose the "block of influence" concept, which enables a modular decomposition and analysis of the coding challenges. We introduce an automated intervention mechanism reminiscent of adversarial testing that exposes undesired biases through the failure modes of the models under test. Finally, we demonstrate how our framework can be used as a data transformation technique during fine-tuning, acting as a mitigation strategy for these biases.

Related papers

Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage. Models may behave unreliably due to poorly explored failure modes. causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z)
An Empirical Study of Retrieval-Augmented Code Generation: Challenges and Opportunities [19.455889970335967]
Code generation aims to automatically generate code snippets of specific programming language according to natural language descriptions. One main challenge of pre-trained models for code generation is the semantic gap between natural language requirements and source code. Retrieval-augmented framework can be leveraged to help understand the requirements and provide guidance for the generation process.
arXiv Detail & Related papers (2025-01-23T15:17:51Z)
Toward Exploring the Code Understanding Capabilities of Pre-trained Code Generation Models [12.959392500354223]
We pioneer the transfer of knowledge from pre-trained code generation models to code understanding tasks. We introduce CL4D, a contrastive learning method designed to enhance the representation capabilities of decoder-only models.
arXiv Detail & Related papers (2024-06-18T06:52:14Z)
Code Representation Learning At Scale [75.04686476303436]
We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme. We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language. We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
arXiv Detail & Related papers (2024-02-02T22:19:15Z)
Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers [14.018844722021896]
We study the specific patterns that characterize machine- and human-authored code. We propose DetectCodeGPT, a novel method for detecting machine-generated code.
arXiv Detail & Related papers (2024-01-12T09:15:20Z)
Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System [73.52878118434147]
We present methods to reverse-engineer the decoding method used to generate text. Our ability to discover which decoding strategy was used has implications for detecting generated text.
arXiv Detail & Related papers (2023-09-09T18:19:47Z)
Robust Graph Representation Learning via Predictive Coding [46.22695915912123]
Predictive coding is a message-passing framework initially developed to model information processing in the brain. In this work, we build models that rely on the message-passing rule of predictive coding. We show that the proposed models are comparable to standard ones in terms of performance in both inductive and transductive tasks.
arXiv Detail & Related papers (2022-12-09T03:58:22Z)
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning [92.36705236706678]
"CodeRL" is a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning. During inference, we introduce a new generation procedure with a critical sampling strategy. For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives.
arXiv Detail & Related papers (2022-07-05T02:42:15Z)
On Adversarial Robustness of Synthetic Code Generation [1.2559148369195197]
This paper showcases the existence of significant dataset bias through different classes of adversarial examples. We propose several dataset augmentation techniques to reduce bias and showcase their efficacy.
arXiv Detail & Related papers (2021-06-22T09:37:48Z)
Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
On the Transferability of Adversarial Attacksagainst Neural Text Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models. We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models. We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z)
Contextualized Perturbation for Textual Adversarial Attack [56.370304308573274]
Adversarial examples expose the vulnerabilities of natural language processing (NLP) models. This paper presents CLARE, a ContextuaLized AdversaRial Example generation model that produces fluent and grammatical outputs.
arXiv Detail & Related papers (2020-09-16T06:53:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.