RoPGen: Towards Robust Code Authorship Attribution via Automatic Coding
Style Transformation
- URL: http://arxiv.org/abs/2202.06043v1
- Date: Sat, 12 Feb 2022 11:27:32 GMT
- Title: RoPGen: Towards Robust Code Authorship Attribution via Automatic Coding
Style Transformation
- Authors: Zhen Li, Guenevere (Qian) Chen, Chen Chen, Yayi Zou, Shouhuai Xu
- Abstract summary: Source code authorship attribution is an important problem often encountered in applications such as software forensics, bug fixing, and software quality analysis.
Recent studies show that current source code authorship attribution methods can be compromised by attackers exploiting adversarial examples and coding style manipulation.
We propose an innovative framework called Robust coding style Patterns Generation (RoPGen)
RoPGen essentially learns authors' unique coding style patterns that are hard for attackers to manipulate or imitate.
- Score: 14.959517725033423
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Source code authorship attribution is an important problem often encountered
in applications such as software forensics, bug fixing, and software quality
analysis. Recent studies show that current source code authorship attribution
methods can be compromised by attackers exploiting adversarial examples and
coding style manipulation. This calls for robust solutions to the problem of
code authorship attribution. In this paper, we initiate the study on making
Deep Learning (DL)-based code authorship attribution robust. We propose an
innovative framework called Robust coding style Patterns Generation (RoPGen),
which essentially learns authors' unique coding style patterns that are hard
for attackers to manipulate or imitate. The key idea is to combine data
augmentation and gradient augmentation at the adversarial training phase. This
effectively increases the diversity of training examples, generates meaningful
perturbations to gradients of deep neural networks, and learns diversified
representations of coding styles. We evaluate the effectiveness of RoPGen using
four datasets of programs written in C, C++, and Java. Experimental results
show that RoPGen can significantly improve the robustness of DL-based code
authorship attribution, by respectively reducing 22.8% and 41.0% of the success
rate of targeted and untargeted attacks on average.
Related papers
- SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - AuthAttLyzer-V2: Unveiling Code Authorship Attribution using Enhanced Ensemble Learning Models & Generating Benchmark Dataset [0.0]
Source Code Authorship Attribution (SCAA) is crucial for software classification because it provides insights into the origin and behavior of software.
This paper presents AuthAttLyzer-V2, a new source code feature extractor for SCAA, focusing on lexical, semantic, syntactic, and N-gram features.
arXiv Detail & Related papers (2024-06-28T13:04:16Z) - ChatGPT Code Detection: Techniques for Uncovering the Source of Code [0.0]
We use advanced classification techniques to differentiate between code written by humans and that generated by ChatGPT.
We employ a new approach that combines powerful embedding features (black-box) with supervised learning algorithms.
We show that untrained humans solve the same task not better than random guessing.
arXiv Detail & Related papers (2024-05-24T12:56:18Z) - Improving Robustness to Model Inversion Attacks via Sparse Coding Architectures [4.962316236417777]
Recent model inversion attack algorithms permit adversaries to reconstruct a neural network's private and potentially sensitive training data by repeatedly querying the network.
We develop a novel network architecture that leverages sparse-coding layers to obtain superior robustness to this class of attacks.
arXiv Detail & Related papers (2024-03-21T18:26:23Z) - Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else.
It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z) - CONCORD: Clone-aware Contrastive Learning for Source Code [64.51161487524436]
Self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks.
We argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning.
In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart.
arXiv Detail & Related papers (2023-06-05T20:39:08Z) - Improving Deep Representation Learning via Auxiliary Learnable Target Coding [69.79343510578877]
This paper introduces a novel learnable target coding as an auxiliary regularization of deep representation learning.
Specifically, a margin-based triplet loss and a correlation consistency loss on the proposed target codes are designed to encourage more discriminative representations.
arXiv Detail & Related papers (2023-05-30T01:38:54Z) - An Unbiased Transformer Source Code Learning with Semantic Vulnerability
Graph [3.3598755777055374]
Current vulnerability screening techniques are ineffective at identifying novel vulnerabilities or providing developers with code vulnerability and classification.
To address these issues, we propose a joint multitasked unbiased vulnerability classifier comprising a transformer "RoBERTa" and graph convolution neural network (GCN)
We present a training process utilizing a semantic vulnerability graph (SVG) representation from source code, created by integrating edges from a sequential flow, control flow, and data flow, as well as a novel flow dubbed Poacher Flow (PF)
arXiv Detail & Related papers (2023-04-17T20:54:14Z) - Software Vulnerability Detection via Deep Learning over Disaggregated
Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora.
Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.