Related papers: How to get better embeddings with code pre-trained models? An empirical study

How to get better embeddings with code pre-trained models? An empirical study

URL: http://arxiv.org/abs/2311.08066v1
Date: Tue, 14 Nov 2023 10:44:21 GMT
Title: How to get better embeddings with code pre-trained models? An empirical study
Authors: Yu Zhao and Lina Gong and Haoxiang Zhang and Yaoshen Yu and Zhiqiu Huang
Abstract summary: We study five different code pre-trained models (PTMs) to generate embeddings for downstream classification tasks. We find that embeddings obtained through special tokens do not sufficiently aggregate the semantic information of the entire code snippet. The quality of code embeddings obtained by combing code data and text data in the same way as pre-training the PTMs is poor and cannot guarantee richer semantic information.
Score: 6.220333404184779
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-trained language models have demonstrated powerful capabilities in the field of natural language processing (NLP). Recently, code pre-trained model (PTM), which draw from the experiences of the NLP field, have also achieved state-of-the-art results in many software engineering (SE) downstream tasks. These code PTMs take into account the differences between programming languages and natural languages during pre-training and make adjustments to pre-training tasks and input data. However, researchers in the SE community still inherit habits from the NLP field when using these code PTMs to generate embeddings for SE downstream classification tasks, such as generating semantic embeddings for code snippets through special tokens and inputting code and text information in the same way as pre-training the PTMs. In this paper, we empirically study five different PTMs (i.e. CodeBERT, CodeT5, PLBART, CodeGPT and CodeGen) with three different architectures (i.e. encoder-only, decoder-only and encoder-decoder) on four SE downstream classification tasks (i.e. code vulnerability detection, code clone detection, just-in-time defect prediction and function docstring mismatch detection) with respect to the two aforementioned aspects. Our experimental results indicate that (1) regardless of the architecture of the code PTMs used, embeddings obtained through special tokens do not sufficiently aggregate the semantic information of the entire code snippet; (2) the quality of code embeddings obtained by combing code data and text data in the same way as pre-training the PTMs is poor and cannot guarantee richer semantic information; (3) using the method that aggregates the vector representations of all code tokens, the decoder-only PTMs can obtain code embeddings with semantics as rich as or even better quality than those obtained from the encoder-only and encoder-decoder PTMs.

Related papers

Is Compression Really Linear with Code Intelligence? [60.123628177110206]
textitFormat Annealing is a lightweight, transparent training methodology designed to assess the intrinsic capabilities of pre-trained models equitably.<n>Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and bits-per-character (BPC)<n>Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.
arXiv Detail & Related papers (2025-05-16T16:59:14Z)
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding [49.56049319037421]
KodCode is a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data. It comprises question-solution-test triplets that are systematically validated via a self-verification procedure. This pipeline yields a large-scale, robust and diverse coding dataset.
arXiv Detail & Related papers (2025-03-04T19:17:36Z)
Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection? [30.84647604639891]
We investigate the effects of code embedding generated by ten different code PTMs on the performance of vulnerability detection. We propose Coding-PTMs, a recommendation framework to assist engineers in selecting optimal code PTMs for their specific vulnerability detection tasks.
arXiv Detail & Related papers (2024-08-09T04:56:26Z)
ESALE: Enhancing Code-Summary Alignment Learning for Source Code Summarization [21.886950861445122]
Code summarization aims to automatically generate succinct natural language summaries for given code snippets. This paper proposes a novel approach to improve code summarization based on summary-focused tasks.
arXiv Detail & Related papers (2024-07-01T03:06:51Z)
Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants. Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts.
arXiv Detail & Related papers (2024-05-25T08:57:28Z)
CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation. CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details. Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z)
On Leveraging Encoder-only Pre-trained Language Models for Effective Keyphrase Generation [76.52997424694767]
This study addresses the application of encoder-only Pre-trained Language Models (PLMs) in keyphrase generation (KPG) With encoder-only PLMs, although KPE with Conditional Random Fields slightly excels in identifying present keyphrases, the KPG formulation renders a broader spectrum of keyphrase predictions. We also identify a favorable parameter allocation towards model depth rather than width when employing encoder-decoder architectures with encoder-only PLMs.
arXiv Detail & Related papers (2024-02-21T18:57:54Z)
CodeT5+: Open Code Large Language Models for Code Understanding and Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z)
Soft-Labeled Contrastive Pre-training for Function-level Code Representation [127.71430696347174]
We present textbfSCodeR, a textbfSoft-labeled contrastive pre-training framework with two positive sample construction methods. Considering the relevance between codes in a large-scale code corpus, the soft-labeled contrastive pre-training can obtain fine-grained soft-labels. SCodeR achieves new state-of-the-art performance on four code-related tasks over seven datasets.
arXiv Detail & Related papers (2022-10-18T05:17:37Z)
An Exploratory Study on Code Attention in BERT [8.488193857572211]
We investigate the attention behavior of PLM on code and compare it with natural language. We show that BERT pays more attention to syntactic entities, specifically identifiers and separators, in contrast to the most attended token in NLP. The findings can benefit the research community by using code-specific representations instead of applying the common embeddings used in NLP.
arXiv Detail & Related papers (2022-04-05T21:23:10Z)
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation [36.47905744758698]
We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers. Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning.
arXiv Detail & Related papers (2021-09-02T12:21:06Z)
GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.