OASIS: Order-Augmented Strategy for Improved Code Search
- URL: http://arxiv.org/abs/2503.08161v3
- Date: Fri, 14 Mar 2025 10:09:13 GMT
- Title: OASIS: Order-Augmented Strategy for Improved Code Search
- Authors: Zuchen Gao, Zizheng Zhan, Xianming Li, Erxin Yu, Haotian Zhang, Bin Chen, Yuqun Zhang, Jing Li,
- Abstract summary: We propose a novel order-augmented strategy for improved code search (OASIS)<n>It leverages order-based similarity labels to train models to capture subtle differences in similarity among negative pairs.<n>It underscores the value of exploiting subtle differences among negative pairs with order labels for effective code embedding training.
- Score: 14.15428333014365
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code embeddings capture the semantic representations of code and are crucial for various code-related large language model (LLM) applications, such as code search. Previous training primarily relies on optimizing the InfoNCE loss by comparing positive natural language (NL)-code pairs with in-batch negatives. However, due to the sparse nature of code contexts, training solely by comparing the major differences between positive and negative pairs may fail to capture deeper semantic nuances. To address this issue, we propose a novel order-augmented strategy for improved code search (OASIS). It leverages order-based similarity labels to train models to capture subtle differences in similarity among negative pairs. Extensive benchmark evaluations demonstrate that our OASIS model significantly outperforms previous state-of-the-art models focusing solely on major positive-negative differences. It underscores the value of exploiting subtle differences among negative pairs with order labels for effective code embedding training.
Related papers
- Code Comparison Tuning for Code Large Language Models [7.03872473285061]
We present Code Comparison Tuning (CCT), a simple and effective tuning method for code large language models (Code LLMs)
CCT integrates the concept of comparison into instruction tuning, both at the token and sequence levels.
We show that CCT surpasses instruction tuning in pass@1 scores by up to 4 points across diverse code LLMs.
arXiv Detail & Related papers (2024-03-28T03:25:23Z) - Contrastive Instruction Tuning [61.97704869248903]
We propose Contrastive Instruction Tuning to maximize the similarity between semantically equivalent instruction-instance pairs.
Experiments on the PromptBench benchmark show that CoIN consistently improves LLMs' robustness to unseen instructions with variations across character, word, sentence, and semantic levels by an average of +2.5% in accuracy.
arXiv Detail & Related papers (2024-02-17T00:09:32Z) - Learning by Sorting: Self-supervised Learning with Group Ordering
Constraints [75.89238437237445]
This paper proposes a new variation of the contrastive learning objective, Group Ordering Constraints (GroCo)
It exploits the idea of sorting the distances of positive and negative pairs and computing the respective loss based on how many positive pairs have a larger distance than the negative pairs, and thus are not ordered correctly.
We evaluate the proposed formulation on various self-supervised learning benchmarks and show that it not only leads to improved results compared to vanilla contrastive learning but also shows competitive performance to comparable methods in linear probing and outperforms current methods in k-NN performance.
arXiv Detail & Related papers (2023-01-05T11:17:55Z) - UniASM: Binary Code Similarity Detection without Fine-tuning [2.2329530239800035]
We propose a novel rich-semantic function representation technique to ensure the model captures the intricate nuances of binary code.<n>We introduce the first UniLM-based binary code embedding model, named UniASM, which includes two newly designed training tasks.<n>The experimental results show that UniASM outperforms the state-of-the-art (SOTA) approaches on the evaluation datasets.
arXiv Detail & Related papers (2022-10-28T14:04:57Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - Not All Negatives are Equal: Label-Aware Contrastive Loss for
Fine-grained Text Classification [0.0]
We analyse the contrastive fine-tuning of pre-trained language models on two fine-grained text classification tasks.
We adaptively embed class relationships into a contrastive objective function to help differently weigh the positives and negatives.
We find that Label-aware Contrastive Loss outperforms previous contrastive methods.
arXiv Detail & Related papers (2021-09-12T04:19:17Z) - Disentangled Contrastive Learning for Learning Robust Textual
Representations [13.880693856907037]
We introduce the concept of momentum representation consistency to align features and leverage power normalization while conforming the uniformity.
Our experimental results for the NLP benchmarks demonstrate that our approach can obtain better results compared with the baselines.
arXiv Detail & Related papers (2021-04-11T03:32:49Z) - Contrastive Learning with Adversarial Examples [79.39156814887133]
Contrastive learning (CL) is a popular technique for self-supervised learning (SSL) of visual representations.
This paper introduces a new family of adversarial examples for constrastive learning and using these examples to define a new adversarial training algorithm for SSL, denoted as CLAE.
arXiv Detail & Related papers (2020-10-22T20:45:10Z) - CIMON: Towards High-quality Hash Codes [63.37321228830102]
We propose a new method named textbfComprehensive stextbfImilarity textbfMining and ctextbfOnsistency leartextbfNing (CIMON)
First, we use global refinement and similarity statistical distribution to obtain reliable and smooth guidance. Second, both semantic and contrastive consistency learning are introduced to derive both disturb-invariant and discriminative hash codes.
arXiv Detail & Related papers (2020-10-15T14:47:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.