Are Decoder-Only Large Language Models the Silver Bullet for Code Search?
- URL: http://arxiv.org/abs/2410.22240v1
- Date: Tue, 29 Oct 2024 17:05:25 GMT
- Title: Are Decoder-Only Large Language Models the Silver Bullet for Code Search?
- Authors: Yuxuan Chen, Guangsheng Ou, Mingwei Liu, Yanlin Wang, Zibin Zheng,
- Abstract summary: This study presents the first systematic exploration of decoder-only large language models for code search.
We evaluate nine state-of-the-art decoder-only models using two fine-tuning methods, two datasets, and three model sizes.
Our findings reveal that fine-tuned CodeGemma significantly outperforms encoder-only models like UniXcoder.
- Score: 32.338318300589776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code search is crucial for code reuse, enabling developers to efficiently locate relevant snippets. Current methods rely on encoder-based models, which suffer from limitations such as poor generalization and restricted input lengths. Decoder-only large language models (LLMs), with their extensive pre-training, larger size, and longer input capabilities, offer potential solutions to these issues, yet their effectiveness in code search remains underexplored. To fill this gap, our study presents the first systematic exploration of decoder-only LLMs for code search. We evaluate nine state-of-the-art decoder-only models using two fine-tuning methods, two datasets (CSN and CoSQA$^+$), and three model sizes. Our findings reveal that fine-tuned CodeGemma significantly outperforms encoder-only models like UniXcoder, achieving a 5.57% improvement in MRR on CSN and a 49.6% increase in MAP on CoSQA$^+$ compared to zero-shot UniXcoder. These results highlight the superior performance and adaptability of decoder-only models. Additionally, we provide valuable insights into optimizing these models for code search, covering aspects such as model selection, fine-tuning methods, training data, and model size, and discussing their strengths and limitations.
Related papers
- Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model [30.945523139748634]
We revisit encoder-decoder LLM (RedLLM), enhancing it with recent recipes from decoder-only LLM (DecLLM)<n>We conduct a comprehensive comparison between RedLLM, pretrained with prefix language modeling (LM), and DecLLM, pretrained with causal LM, at different model scales.<n>Using RedPajama V1 (1.6T tokens) for pretraining and FLAN for instruction tuning, our experiments show that RedLLM produces compelling scaling properties and surprisingly strong performance.
arXiv Detail & Related papers (2025-10-30T15:48:28Z) - Seq vs Seq: An Open Suite of Paired Encoders and Decoders [37.62535961965971]
We introduce the SOTA open-data Ettin suite of models: paired encoder-only and decoder-only models ranging from 17 million parameters to 1 billion.<n>Using the same recipe for both encoder-only and decoder-only models produces SOTA recipes in both categories for their respective sizes.<n>We show that adapting a decoder model to encoder tasks (and vice versa) through continued training is subpar compared to using only the reverse objective.
arXiv Detail & Related papers (2025-07-15T15:31:51Z) - Leveraging Decoder Architectures for Learned Sparse Retrieval [26.483483554222012]
Learned Sparse Retrieval (LSR) has traditionally focused on small-scale encoder-only transformer architectures.
This study investigates the effectiveness of LSR across different transformer-based architectures.
arXiv Detail & Related papers (2025-04-25T08:04:52Z) - Encoder-Decoder Gemma: Improving the Quality-Efficiency Trade-Off via Adaptation [52.19855651708349]
We study a novel problem: adapting decoder-only large language models to encoder-decoder models.
We argue that adaptation not only enables inheriting the capability of decoder-only LLMs but also reduces the demand for computation.
Under similar inference budget, encoder-decoder LLMs achieve comparable (often better) pretraining performance but substantially better finetuning performance than their decoder-only counterpart.
arXiv Detail & Related papers (2025-04-08T17:13:41Z) - Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM [43.77512279007385]
Ling-Coder-Lite is a code large language model with comprehensive performance yet ultimate efficiency.<n>We leverage the efficient Mixture-of-Experts (MoE) architecture along with a set of high-quality data curation methods.<n>Ling-Coder-Lite exhibits on-par performance on 12 representative coding benchmarks compared to state-of-the-art models of similar size.
arXiv Detail & Related papers (2025-03-22T15:00:18Z) - D2LLM: Decomposed and Distilled Large Language Models for Semantic Search [18.63768158439252]
We present D2LLMs-Decomposed and Distilled LLMs for semantic search.
We decompose a cross-encoder into an efficient bi-encoder integrated with Pooling by Multihead Attention and an Interaction Emulation Module.
Our experiments show that D2LLM surpasses five leading baselines in terms of all metrics across three tasks.
arXiv Detail & Related papers (2024-06-25T04:03:04Z) - Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks [4.851704512420683]
We introduce a method for evaluating decoder models on NLU tasks and apply it to the languages Danish, Swedish, Norwegian, Icelandic, Faroese, German, Dutch, and English.<n>Our findings reveal that encoder models can achieve significantly better NLU performance than decoder models despite having orders of magnitude fewer parameters.
arXiv Detail & Related papers (2024-06-19T11:50:09Z) - You Only Cache Once: Decoder-Decoder Architectures for Language Models [132.4064488592704]
We introduce a decoder-decoder architecture, YOCO, for large language models.
YOCO only caches key-value pairs once.
The overall model behaves like a decoder-only Transformer, although YOCO only caches once.
arXiv Detail & Related papers (2024-05-08T17:57:39Z) - Extreme Encoder Output Frame Rate Reduction: Improving Computational
Latencies of Large End-to-End Models [59.57732929473519]
We apply multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames.
We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task.
arXiv Detail & Related papers (2024-02-27T03:40:44Z) - Comparative Study on the Performance of Categorical Variable Encoders in
Classification and Regression Tasks [11.721062526796976]
This study broadly classifies machine learning models into three categories: 1) ATI models that implicitly perform affine transformations on inputs; 2) Tree-based models that are based on decision trees; and 3) the rest, such as kNN.
Theoretically, we prove that the one-hot encoder is the best choice for ATI models in the sense that it can mimic any other encoders by learning suitable weights from the data.
We also explain why the target encoder and its variants are the most suitable encoders for tree-based models.
arXiv Detail & Related papers (2024-01-18T02:21:53Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - Speculative Contrastive Decoding [55.378200871224074]
Large language models(LLMs) exhibit exceptional performance in language tasks, yet their auto-regressive inference is limited due to high computational requirements and is sub-optimal due to the exposure bias.
Inspired by speculative decoding and contrastive decoding, we introduce Speculative Contrastive Decoding(SCD), a straightforward yet powerful decoding approach.
arXiv Detail & Related papers (2023-11-15T14:15:30Z) - CodeT5+: Open Code Large Language Models for Code Understanding and
Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence.
CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks.
We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z) - Improving Code Search with Hard Negative Sampling Based on Fine-tuning [15.341959871682981]
We introduce a cross-encoder architecture for code search that jointly encodes the concatenation of query and code.
We also introduce a Retriever-Ranker (RR) framework that cascades the dual-encoder and cross-encoder to promote the efficiency of evaluation and online serving.
arXiv Detail & Related papers (2023-05-08T07:04:28Z) - Decoder-Only or Encoder-Decoder? Interpreting Language Model as a
Regularized Encoder-Decoder [75.03283861464365]
The seq2seq task aims at generating the target sequence based on the given input source sequence.
Traditionally, most of the seq2seq task is resolved by an encoder to encode the source sequence and a decoder to generate the target text.
Recently, a bunch of new approaches have emerged that apply decoder-only language models directly to the seq2seq task.
arXiv Detail & Related papers (2023-04-08T15:44:29Z) - Lego-Features: Exporting modular encoder features for streaming and
deliberation ASR [34.23347991756358]
We build on work that has begun to explore building encoders with modular encoded representations.
Our framework builds on top of existing encoded representations, converting them to modular features, dubbed as Lego-Features.
Though sparse, we show that the Lego-Features are powerful when tested with RNN-T or LAS decoders.
arXiv Detail & Related papers (2023-03-31T23:33:21Z) - Machine Learning-Aided Efficient Decoding of Reed-Muller Subcodes [59.55193427277134]
Reed-Muller (RM) codes achieve the capacity of general binary-input memoryless symmetric channels.
RM codes only admit limited sets of rates.
Efficient decoders are available for RM codes at finite lengths.
arXiv Detail & Related papers (2023-01-16T04:11:14Z) - Revisiting Code Search in a Two-Stage Paradigm [67.02322603435628]
TOSS is a two-stage fusion code search framework.
It first uses IR-based and bi-encoder models to efficiently recall a small number of top-k code candidates.
It then uses fine-grained cross-encoders for finer ranking.
arXiv Detail & Related papers (2022-08-24T02:34:27Z) - ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking
Inference [70.36083572306839]
This paper proposes a new training and inference paradigm for re-ranking.
We finetune a pretrained encoder-decoder model using in the form of document to query generation.
We show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference.
arXiv Detail & Related papers (2022-04-25T06:26:29Z) - UniXcoder: Unified Cross-Modal Pre-training for Code Representation [65.6846553962117]
We present UniXcoder, a unified cross-modal pre-trained model for programming language.
We propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree.
We evaluate UniXcoder on five code-related tasks over nine datasets.
arXiv Detail & Related papers (2022-03-08T04:48:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.