CodeSAM: Source Code Representation Learning by Infusing Self-Attention with Multi-Code-View Graphs
- URL: http://arxiv.org/abs/2411.14611v1
- Date: Thu, 21 Nov 2024 22:24:47 GMT
- Title: CodeSAM: Source Code Representation Learning by Infusing Self-Attention with Multi-Code-View Graphs
- Authors: Alex Mathai, Kranthi Sedamaki, Debeshee Das, Noble Saji Mathews, Srikanth Tamilselvam, Sridhar Chimalakonda, Atul Kumar,
- Abstract summary: We propose CodeSAM, a novel framework to infuse multiple code-views into transformer-based models by creating self-attention masks.
We use CodeSAM to fine-tune a small language model (SLM) like CodeBERT on the downstream SE tasks of semantic code search, code clone detection, and program classification.
- Score: 8.850533100643547
- License:
- Abstract: Machine Learning (ML) for software engineering (SE) has gained prominence due to its ability to significantly enhance the performance of various SE applications. This progress is largely attributed to the development of generalizable source code representations that effectively capture the syntactic and semantic characteristics of code. In recent years, pre-trained transformer-based models, inspired by natural language processing (NLP), have shown remarkable success in SE tasks. However, source code contains structural and semantic properties embedded within its grammar, which can be extracted from structured code-views like the Abstract Syntax Tree (AST), Data-Flow Graph (DFG), and Control-Flow Graph (CFG). These code-views can complement NLP techniques, further improving SE tasks. Unfortunately, there are no flexible frameworks to infuse arbitrary code-views into existing transformer-based models effectively. Therefore, in this work, we propose CodeSAM, a novel scalable framework to infuse multiple code-views into transformer-based models by creating self-attention masks. We use CodeSAM to fine-tune a small language model (SLM) like CodeBERT on the downstream SE tasks of semantic code search, code clone detection, and program classification. Experimental results show that by using this technique, we improve downstream performance when compared to SLMs like GraphCodeBERT and CodeBERT on all three tasks by utilizing individual code-views or a combination of code-views during fine-tuning. We believe that these results are indicative that techniques like CodeSAM can help create compact yet performant code SLMs that fit in resource constrained settings.
Related papers
- An Empirical Study on Capability of Large Language Models in Understanding Code Semantics [4.638578225024275]
Large Language Models for Code (code LLMs) have demonstrated remarkable performance across various software engineering (SE) tasks.
This paper introduces EMPICA, a framework designed to evaluate the capabilities of code LLMs in understanding code semantics.
arXiv Detail & Related papers (2024-07-04T03:40:58Z) - AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data [64.69872638349922]
We present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data.
We propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review.
arXiv Detail & Related papers (2024-05-29T16:57:33Z) - TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation [9.477734501499274]
We present TransformCode, a novel framework that learns code embeddings in a contrastive learning manner.
Our framework is encoder-agnostic and language-agnostic, which means that it can leverage any encoder model and handle any programming language.
arXiv Detail & Related papers (2023-11-10T09:05:23Z) - Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for
Code Generation [22.219645213202178]
This paper proposes the "Semantic Chain-of-Thought" approach to intruduce semantic information of code, named SeCoT.
We show that SeCoT can achieves state-of-the-art performance, greatly improving the potential for large models and code generation.
arXiv Detail & Related papers (2023-10-16T05:09:58Z) - CodeT5+: Open Code Large Language Models for Code Understanding and
Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence.
CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks.
We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - UniXcoder: Unified Cross-Modal Pre-training for Code Representation [65.6846553962117]
We present UniXcoder, a unified cross-modal pre-trained model for programming language.
We propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree.
We evaluate UniXcoder on five code-related tasks over nine datasets.
arXiv Detail & Related papers (2022-03-08T04:48:07Z) - Contrastive Learning for Source Code with Structural and Functional
Properties [66.10710134948478]
We present BOOST, a novel self-supervised model to focus pre-training based on the characteristics of source code.
We employ automated, structure-guided code transformation algorithms that generate functionally equivalent code that looks drastically different from the original one.
We train our model in a way that brings the functionally equivalent code closer and distinct code further through a contrastive learning objective.
arXiv Detail & Related papers (2021-10-08T02:56:43Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.