Related papers: CoSQA: 20,000+ Web Queries for Code Search and Question Answering

CoSQA: 20,000+ Web Queries for Code Search and Question Answering

URL: http://arxiv.org/abs/2105.13239v1
Date: Thu, 27 May 2021 15:37:21 GMT
Title: CoSQA: 20,000+ Web Queries for Code Search and Question Answering
Authors: Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, Nan Duan
Abstract summary: CoSQA dataset includes 20,604 labels for pairs of natural language queries and codes. We introduce a contrastive learning method dubbed CoCLR to enhance query-code matching. We show that evaluated on CodeXGLUE with the same CodeBERT model, training on CoSQA improves the accuracy of code question answering by 5.1%.
Score: 63.92224685262063
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Finding codes given natural language query isb eneficial to the productivity of software developers. Future progress towards better semantic matching between query and code requires richer supervised training resources. To remedy this, we introduce the CoSQA dataset.It includes 20,604 labels for pairs of natural language queries and codes, each annotated by at least 3 human annotators. We further introduce a contrastive learning method dubbed CoCLR to enhance query-code matching, which works as a data augmenter to bring more artificially generated training instances. We show that evaluated on CodeXGLUE with the same CodeBERT model, training on CoSQA improves the accuracy of code question answering by 5.1%, and incorporating CoCLR brings a further improvement of 10.5%.

Related papers

Zero-Shot Cross-Domain Code Search without Fine-Tuning [12.905068305900356]
We propose a zero-shot, fine-tuning-free approach for cross-domain code search. CodeBridge combines query-code, query-comment, and code-code matching via PLM-based similarity scoring and sampling-based fusion. Our approach outperforms the state-of-the-art PLM-based code search approaches, i.e., CoCoSoDa and UniXcoder, by an average of 21.4% and 24.9% in MRR, respectively.
arXiv Detail & Related papers (2025-04-10T13:36:37Z)
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding [49.56049319037421]
KodCode is a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data. It comprises question-solution-test triplets that are systematically validated via a self-verification procedure. This pipeline yields a large-scale, robust and diverse coding dataset.
arXiv Detail & Related papers (2025-03-04T19:17:36Z)
CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [103.116634967815]
We introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters. Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework. Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.
arXiv Detail & Related papers (2024-11-19T16:54:45Z)
CoSQA+: Enhancing Code Search Dataset with Matching Code [27.10957318333608]
CoSQA+ pairs high-quality queries with multiple suitable codes. CoSQA+ has demonstrated superior quality over CoSQA. We propose a new metric to assess one-to-N code search performance.
arXiv Detail & Related papers (2024-06-17T14:34:14Z)
Prompt-based Code Completion via Multi-Retrieval Augmented Generation [15.233727939816388]
ProCC is a code completion framework leveraging prompt engineering and the contextual multi-armed bandits algorithm. ProCC outperforms state-of-the-art code completion technique by 8.6% on our collected open-source benchmark suite. ProCC also allows augmenting fine-tuned techniques in a plug-and-play manner, yielding 5.6% improvement over our studied fine-tuned model.
arXiv Detail & Related papers (2024-05-13T07:56:15Z)
ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search [8.700556381819267]
We introduce ProCQA, a large-scale programming question answering dataset extracted from the StackOverflow community. We propose a modality-agnostic contrastive pre-training approach to improve the alignment of text and code representations of current code language models.
arXiv Detail & Related papers (2024-03-25T12:34:33Z)
Modular Visual Question Answering via Code Generation [134.59005611826777]
We present a framework that formulates visual question answering as modular code generation. Our approach requires no additional training and relies on pre-trained language models (LMs), visual models pre-trained on image-caption pairs, and fifty VQA examples used for in-context learning. Our approach improves accuracy on the COVR dataset by at least 3% and on the GQA dataset by roughly 2% compared to the few-shot baseline that does not employ code generation.
arXiv Detail & Related papers (2023-06-08T17:45:14Z)
Improving Code Search with Hard Negative Sampling Based on Fine-tuning [15.341959871682981]
We introduce a cross-encoder architecture for code search that jointly encodes the concatenation of query and code. We also introduce a Retriever-Ranker (RR) framework that cascades the dual-encoder and cross-encoder to promote the efficiency of evaluation and online serving.
arXiv Detail & Related papers (2023-05-08T07:04:28Z)
Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z)
ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval. We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z)
COSEA: Convolutional Code Search with Layer-wise Attention [90.35777733464354]
We propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the code's intrinsic structural logic. COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
arXiv Detail & Related papers (2020-10-19T13:53:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.