CoSQA: 20,000+ Web Queries for Code Search and Question Answering
- URL: http://arxiv.org/abs/2105.13239v1
- Date: Thu, 27 May 2021 15:37:21 GMT
- Title: CoSQA: 20,000+ Web Queries for Code Search and Question Answering
- Authors: Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang,
Ming Zhou, Nan Duan
- Abstract summary: CoSQA dataset includes 20,604 labels for pairs of natural language queries and codes.
We introduce a contrastive learning method dubbed CoCLR to enhance query-code matching.
We show that evaluated on CodeXGLUE with the same CodeBERT model, training on CoSQA improves the accuracy of code question answering by 5.1%.
- Score: 63.92224685262063
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Finding codes given natural language query isb eneficial to the productivity
of software developers. Future progress towards better semantic matching
between query and code requires richer supervised training resources. To remedy
this, we introduce the CoSQA dataset.It includes 20,604 labels for pairs of
natural language queries and codes, each annotated by at least 3 human
annotators. We further introduce a contrastive learning method dubbed CoCLR to
enhance query-code matching, which works as a data augmenter to bring more
artificially generated training instances. We show that evaluated on CodeXGLUE
with the same CodeBERT model, training on CoSQA improves the accuracy of code
question answering by 5.1%, and incorporating CoCLR brings a further
improvement of 10.5%.
Related papers
- CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [103.116634967815]
We introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters.
Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework.
Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.
arXiv Detail & Related papers (2024-11-19T16:54:45Z) - CoSQA+: Enhancing Code Search Dataset with Matching Code [27.10957318333608]
CoSQA+ pairs high-quality queries with multiple suitable codes.
CoSQA+ has demonstrated superior quality over CoSQA.
We propose a new metric to assess one-to-N code search performance.
arXiv Detail & Related papers (2024-06-17T14:34:14Z) - Prompt-based Code Completion via Multi-Retrieval Augmented Generation [15.233727939816388]
ProCC is a code completion framework leveraging prompt engineering and the contextual multi-armed bandits algorithm.
ProCC outperforms state-of-the-art code completion technique by 8.6% on our collected open-source benchmark suite.
ProCC also allows augmenting fine-tuned techniques in a plug-and-play manner, yielding 5.6% improvement over our studied fine-tuned model.
arXiv Detail & Related papers (2024-05-13T07:56:15Z) - ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search [8.700556381819267]
We introduce ProCQA, a large-scale programming question answering dataset extracted from the StackOverflow community.
We propose a modality-agnostic contrastive pre-training approach to improve the alignment of text and code representations of current code language models.
arXiv Detail & Related papers (2024-03-25T12:34:33Z) - Modular Visual Question Answering via Code Generation [134.59005611826777]
We present a framework that formulates visual question answering as modular code generation.
Our approach requires no additional training and relies on pre-trained language models (LMs), visual models pre-trained on image-caption pairs, and fifty VQA examples used for in-context learning.
Our approach improves accuracy on the COVR dataset by at least 3% and on the GQA dataset by roughly 2% compared to the few-shot baseline that does not employ code generation.
arXiv Detail & Related papers (2023-06-08T17:45:14Z) - Improving Code Search with Hard Negative Sampling Based on Fine-tuning [15.341959871682981]
We introduce a cross-encoder architecture for code search that jointly encodes the concatenation of query and code.
We also introduce a Retriever-Ranker (RR) framework that cascades the dual-encoder and cross-encoder to promote the efficiency of evaluation and online serving.
arXiv Detail & Related papers (2023-05-08T07:04:28Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - COSEA: Convolutional Code Search with Layer-wise Attention [90.35777733464354]
We propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the code's intrinsic structural logic.
COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
arXiv Detail & Related papers (2020-10-19T13:53:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.