Rethinking Negative Pairs in Code Search
- URL: http://arxiv.org/abs/2310.08069v1
- Date: Thu, 12 Oct 2023 06:32:42 GMT
- Title: Rethinking Negative Pairs in Code Search
- Authors: Haochen Li, Xin Zhou, Luu Anh Tuan, Chunyan Miao
- Abstract summary: We propose a simple yet effective Soft-InfoNCE loss that inserts weight terms into InfoNCE.
We analyze the effects of Soft-InfoNCE on controlling the distribution of learnt code representations and on deducing a more precise mutual information estimation.
- Score: 56.23857828689406
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, contrastive learning has become a key component in fine-tuning code
search models for software development efficiency and effectiveness. It pulls
together positive code snippets while pushing negative samples away given
search queries. Among contrastive learning, InfoNCE is the most widely used
loss function due to its better performance. However, the following problems in
negative samples of InfoNCE may deteriorate its representation learning: 1) The
existence of false negative samples in large code corpora due to duplications.
2). The failure to explicitly differentiate between the potential relevance of
negative samples. As an example, a bubble sorting algorithm example is less
``negative'' than a file saving function for the quick sorting algorithm query.
In this paper, we tackle the above problems by proposing a simple yet effective
Soft-InfoNCE loss that inserts weight terms into InfoNCE. In our proposed loss
function, we apply three methods to estimate the weights of negative pairs and
show that the vanilla InfoNCE loss is a special case of Soft-InfoNCE.
Theoretically, we analyze the effects of Soft-InfoNCE on controlling the
distribution of learnt code representations and on deducing a more precise
mutual information estimation. We furthermore discuss the superiority of
proposed loss functions with other design alternatives. Extensive experiments
demonstrate the effectiveness of Soft-InfoNCE and weights estimation methods
under state-of-the-art code search models on a large-scale public dataset
consisting of six programming languages. Source code is available at
\url{https://github.com/Alex-HaochenLi/Soft-InfoNCE}.
Related papers
- Enhancing Consistency and Mitigating Bias: A Data Replay Approach for
Incremental Learning [100.7407460674153]
Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks.
To mitigate the problem, a line of methods propose to replay the data of experienced tasks when learning new tasks.
However, it is not expected in practice considering the memory constraint or data privacy issue.
As a replacement, data-free data replay methods are proposed by inverting samples from the classification model.
arXiv Detail & Related papers (2024-01-12T12:51:12Z) - Generating Enhanced Negatives for Training Language-Based Object Detectors [86.1914216335631]
We propose to leverage the vast knowledge built into modern generative models to automatically build negatives that are more relevant to the original data.
Specifically, we use large-language-models to generate negative text descriptions, and text-to-image diffusion models to also generate corresponding negative images.
Our experimental analysis confirms the relevance of the generated negative data, and its use in language-based detectors improves performance on two complex benchmarks.
arXiv Detail & Related papers (2023-12-29T23:04:00Z) - Siamese Prototypical Contrastive Learning [24.794022951873156]
Contrastive Self-supervised Learning (CSL) is a practical solution that learns meaningful visual representations from massive data in an unsupervised approach.
In this paper, we tackle this problem by introducing a simple but effective contrastive learning framework.
The key insight is to employ siamese-style metric loss to match intra-prototype features, while increasing the distance between inter-prototype features.
arXiv Detail & Related papers (2022-08-18T13:25:30Z) - Positive-Negative Equal Contrastive Loss for Semantic Segmentation [8.664491798389662]
Previous works commonly design plug-and-play modules and structural losses to effectively extract and aggregate the global context.
We propose Positive-Negative Equal contrastive loss (PNE loss), which increases the latent impact of positive embedding on the anchor and treats the positive as well as negative sample pairs equally.
We conduct comprehensive experiments and achieve state-of-the-art performance on two benchmark datasets.
arXiv Detail & Related papers (2022-07-04T13:51:29Z) - Learning Fast Sample Re-weighting Without Reward Data [41.92662851886547]
This paper presents a novel learning-based fast sample re-weighting (FSR) method that does not require additional reward data.
Our experiments show the proposed method achieves competitive results compared to state of the arts on label noise and long-tailed recognition.
arXiv Detail & Related papers (2021-09-07T17:30:56Z) - Neural Code Summarization: How Far Are We? [30.324396716447602]
Deep learning techniques have been exploited to automatically generate summaries for given code snippets.
In this paper, we conduct a systematic and in-depth analysis of five state-of-the-art neural source code summarization models.
arXiv Detail & Related papers (2021-07-15T04:33:59Z) - Rethinking InfoNCE: How Many Negative Samples Do You Need? [54.146208195806636]
We study how many negative samples are optimal for InfoNCE in different scenarios via a semi-quantitative theoretical framework.
We estimate the optimal negative sampling ratio using the $K$ value that maximizes the training effectiveness function.
arXiv Detail & Related papers (2021-05-27T08:38:29Z) - Contrastive Learning with Hard Negative Samples [80.12117639845678]
We develop a new family of unsupervised sampling methods for selecting hard negative samples.
A limiting case of this sampling results in a representation that tightly clusters each class, and pushes different classes as far apart as possible.
The proposed method improves downstream performance across multiple modalities, requires only few additional lines of code to implement, and introduces no computational overhead.
arXiv Detail & Related papers (2020-10-09T14:18:53Z) - SCE: Scalable Network Embedding from Sparsest Cut [20.08464038805681]
Large-scale network embedding is to learn a latent representation for each node in an unsupervised manner.
A key of success to such contrastive learning methods is how to draw positive and negative samples.
In this paper, we propose SCE for unsupervised network embedding only using negative samples for training.
arXiv Detail & Related papers (2020-06-30T03:18:15Z) - Reinforced Negative Sampling over Knowledge Graph for Recommendation [106.07209348727564]
We develop a new negative sampling model, Knowledge Graph Policy Network (kgPolicy), which works as a reinforcement learning agent to explore high-quality negatives.
kgPolicy navigates from the target positive interaction, adaptively receives knowledge-aware negative signals, and ultimately yields a potential negative item to train the recommender.
arXiv Detail & Related papers (2020-03-12T12:44:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.