Studying Vulnerable Code Entities in R
- URL: http://arxiv.org/abs/2402.04421v1
- Date: Tue, 6 Feb 2024 21:39:55 GMT
- Title: Studying Vulnerable Code Entities in R
- Authors: Zixiao Zhao, Millon Madhur Das, Fatemeh H. Fard
- Abstract summary: This study investigates the vulnerability of Code-PLMs for code entities in R.
CodeAttack is a black-box attack model that uses the structure of code to generate adversarial code samples.
Results show that the most vulnerable code entity is the identifier, followed by some syntax tokens specific to R.
- Score: 2.225268436173329
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained Code Language Models (Code-PLMs) have shown many advancements and
achieved state-of-the-art results for many software engineering tasks in the
past few years. These models are mainly targeted for popular programming
languages such as Java and Python, leaving out many other ones like R. Though R
has a wide community of developers and users, there is little known about the
applicability of Code-PLMs for R. In this preliminary study, we aim to
investigate the vulnerability of Code-PLMs for code entities in R. For this
purpose, we use an R dataset of code and comment pairs and then apply
CodeAttack, a black-box attack model that uses the structure of code to
generate adversarial code samples. We investigate how the model can attack
different entities in R. This is the first step towards understanding the
importance of R token types, compared to popular programming languages (e.g.,
Java). We limit our study to code summarization. Our results show that the most
vulnerable code entity is the identifier, followed by some syntax tokens
specific to R. The results can shed light on the importance of token types and
help in developing models for code summarization and method name prediction for
the R language.
Related papers
- Which Programming Language and Model Work Best With LLM-as-a-Judge For Code Retrieval? [0.0]
Benefits of better code search include faster new developer on-boarding, reduced software maintenance, and ease of understanding for large repositories.<n>Despite improvements in search algorithms and search benchmarks, the domain of code search has lagged behind.<n>We study the use of Large Language Models (LLMs) to retrieve code at the level of functions and to generate annotations for code search results.
arXiv Detail & Related papers (2025-09-30T22:31:36Z) - IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z) - Retrieval-Augmented Code Review Comment Generation [0.0]
Automated code review comment generation (RCG) aims to assist developers by automatically producing natural language feedback for code changes.<n>Existing approaches are primarily either generation-based, using pretrained language models, or information retrieval-based (IR), reusing comments from similar past examples.<n>This work proposes to leverage retrieval-augmented generation (RAG) for RCG by conditioning pretrained language models on retrieved code-review exemplars.
arXiv Detail & Related papers (2025-06-13T08:58:20Z) - Is Compression Really Linear with Code Intelligence? [60.123628177110206]
textitFormat Annealing is a lightweight, transparent training methodology designed to assess the intrinsic capabilities of pre-trained models equitably.<n>Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and bits-per-character (BPC)<n>Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.
arXiv Detail & Related papers (2025-05-16T16:59:14Z) - ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding [60.37988508851391]
Language models (LMs) have become a staple of the code-writing toolbox.
Research exploring modifications to Code-LMs' pre-training objectives, geared towards improving data efficiency and better disentangling between syntax and semantics, has been noticeably sparse.
In this work, we examine grounding on obfuscated code as a means of helping Code-LMs look beyond the surface-form syntax and enhance their pre-training sample efficiency.
arXiv Detail & Related papers (2025-03-27T23:08:53Z) - Resource-Efficient & Effective Code Summarization [3.512140256677132]
GreenAI techniques, such as QLoRA, offer a promising path for dealing with large models' sustainability.
Our study evaluates two state-of-the-art CLMs across two programming languages: Python and Java.
Results show that QLoRA enables efficient fine-tuning of CLMs for code summarization.
arXiv Detail & Related papers (2025-02-05T21:06:30Z) - Do Current Language Models Support Code Intelligence for R Programming Language? [2.225268436173329]
We evaluate Code-PLMs for two tasks of code summarization and method name prediction using several settings and strategies.
Our results demonstrate that the studied models have experienced varying degrees of performance degradation.
The dual syntax paradigms in R significantly impact the models' performance, particularly in code summarization tasks.
arXiv Detail & Related papers (2024-10-10T10:23:23Z) - VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development.
We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM)
We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.
CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details.
Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - Comments as Natural Logic Pivots: Improve Code Generation via Comment Perspective [85.48043537327258]
We propose MANGO (comMents As Natural loGic pivOts), including a comment contrastive training strategy and a corresponding logical comment decoding strategy.
Results indicate that MANGO significantly improves the code pass rate based on the strong baselines.
The robustness of the logical comment decoding strategy is notably higher than the Chain-of-thoughts prompting.
arXiv Detail & Related papers (2024-04-11T08:30:46Z) - Traces of Memorisation in Large Language Models for Code [16.125924759649106]
Large language models for code are commonly trained on large unsanitised corpora of source code scraped from the internet.
We compare the rate of memorisation with large language models trained on natural language.
We find that large language models for code are vulnerable to data extraction attacks, like their natural language counterparts.
arXiv Detail & Related papers (2023-12-18T19:12:58Z) - Leveraging Generative AI: Improving Software Metadata Classification
with Generated Code-Comment Pairs [0.0]
In software development, code comments play a crucial role in enhancing code comprehension and collaboration.
This research paper addresses the challenge of objectively classifying code comments as "Useful" or "Not Useful"
We propose a novel solution that harnesses contextualized embeddings, particularly BERT, to automate this classification process.
arXiv Detail & Related papers (2023-10-14T12:09:43Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - CLSEBERT: Contrastive Learning for Syntax Enhanced Code Pre-Trained
Model [23.947178895479464]
We propose CLSEBERT, a Constrastive Learning Framework for Syntax Enhanced Code Pre-Trained Model.
In the pre-training stage, we consider the code syntax and hierarchy contained in the Abstract Syntax Tree (AST)
We also introduce two novel pre-training objectives. One is to predict the edges between nodes in the abstract syntax tree, and the other is to predict the types of code tokens.
arXiv Detail & Related papers (2021-08-10T10:08:21Z) - InferCode: Self-Supervised Learning of Code Representations by
Predicting Subtrees [17.461451218469062]
This paper proposes InferCode to overcome the limitation by adapting the self-language learning mechanism to build source code model.
Subtrees in ASTs are treated with InferCode as the labels for training code representations without any human labeling effort or the overhead of expensive graph construction.
Compared to previous code learning techniques applied to the same downstream tasks, such as Code2Vec, Code2Seq, ASTNN, higher performance results are achieved using our pre-trained InferCode model.
arXiv Detail & Related papers (2020-12-13T10:33:41Z) - COSEA: Convolutional Code Search with Layer-wise Attention [90.35777733464354]
We propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the code's intrinsic structural logic.
COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
arXiv Detail & Related papers (2020-10-19T13:53:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.