Are Decoder-Only Large Language Models the Silver Bullet for Code Search?
- URL: http://arxiv.org/abs/2410.22240v1
- Date: Tue, 29 Oct 2024 17:05:25 GMT
- Title: Are Decoder-Only Large Language Models the Silver Bullet for Code Search?
- Authors: Yuxuan Chen, Guangsheng Ou, Mingwei Liu, Yanlin Wang, Zibin Zheng,
- Abstract summary: This study presents the first systematic exploration of decoder-only large language models for code search.
We evaluate nine state-of-the-art decoder-only models using two fine-tuning methods, two datasets, and three model sizes.
Our findings reveal that fine-tuned CodeGemma significantly outperforms encoder-only models like UniXcoder.
- Score: 32.338318300589776
- License:
- Abstract: Code search is crucial for code reuse, enabling developers to efficiently locate relevant snippets. Current methods rely on encoder-based models, which suffer from limitations such as poor generalization and restricted input lengths. Decoder-only large language models (LLMs), with their extensive pre-training, larger size, and longer input capabilities, offer potential solutions to these issues, yet their effectiveness in code search remains underexplored. To fill this gap, our study presents the first systematic exploration of decoder-only LLMs for code search. We evaluate nine state-of-the-art decoder-only models using two fine-tuning methods, two datasets (CSN and CoSQA$^+$), and three model sizes. Our findings reveal that fine-tuned CodeGemma significantly outperforms encoder-only models like UniXcoder, achieving a 5.57% improvement in MRR on CSN and a 49.6% increase in MAP on CoSQA$^+$ compared to zero-shot UniXcoder. These results highlight the superior performance and adaptability of decoder-only models. Additionally, we provide valuable insights into optimizing these models for code search, covering aspects such as model selection, fine-tuning methods, training data, and model size, and discussing their strengths and limitations.
Related papers
- You Only Cache Once: Decoder-Decoder Architectures for Language Models [132.4064488592704]
We introduce a decoder-decoder architecture, YOCO, for large language models.
YOCO only caches key-value pairs once.
The overall model behaves like a decoder-only Transformer, although YOCO only caches once.
arXiv Detail & Related papers (2024-05-08T17:57:39Z) - Extreme Encoder Output Frame Rate Reduction: Improving Computational
Latencies of Large End-to-End Models [59.57732929473519]
We apply multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames.
We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task.
arXiv Detail & Related papers (2024-02-27T03:40:44Z) - Comparative Study on the Performance of Categorical Variable Encoders in
Classification and Regression Tasks [11.721062526796976]
This study broadly classifies machine learning models into three categories: 1) ATI models that implicitly perform affine transformations on inputs; 2) Tree-based models that are based on decision trees; and 3) the rest, such as kNN.
Theoretically, we prove that the one-hot encoder is the best choice for ATI models in the sense that it can mimic any other encoders by learning suitable weights from the data.
We also explain why the target encoder and its variants are the most suitable encoders for tree-based models.
arXiv Detail & Related papers (2024-01-18T02:21:53Z) - Lego-Features: Exporting modular encoder features for streaming and
deliberation ASR [34.23347991756358]
We build on work that has begun to explore building encoders with modular encoded representations.
Our framework builds on top of existing encoded representations, converting them to modular features, dubbed as Lego-Features.
Though sparse, we show that the Lego-Features are powerful when tested with RNN-T or LAS decoders.
arXiv Detail & Related papers (2023-03-31T23:33:21Z) - Machine Learning-Aided Efficient Decoding of Reed-Muller Subcodes [59.55193427277134]
Reed-Muller (RM) codes achieve the capacity of general binary-input memoryless symmetric channels.
RM codes only admit limited sets of rates.
Efficient decoders are available for RM codes at finite lengths.
arXiv Detail & Related papers (2023-01-16T04:11:14Z) - Revisiting Code Search in a Two-Stage Paradigm [67.02322603435628]
TOSS is a two-stage fusion code search framework.
It first uses IR-based and bi-encoder models to efficiently recall a small number of top-k code candidates.
It then uses fine-grained cross-encoders for finer ranking.
arXiv Detail & Related papers (2022-08-24T02:34:27Z) - ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking
Inference [70.36083572306839]
This paper proposes a new training and inference paradigm for re-ranking.
We finetune a pretrained encoder-decoder model using in the form of document to query generation.
We show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference.
arXiv Detail & Related papers (2022-04-25T06:26:29Z) - UniXcoder: Unified Cross-Modal Pre-training for Code Representation [65.6846553962117]
We present UniXcoder, a unified cross-modal pre-trained model for programming language.
We propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree.
We evaluate UniXcoder on five code-related tasks over nine datasets.
arXiv Detail & Related papers (2022-03-08T04:48:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.