CodeQueries: A Dataset of Semantic Queries over Code
- URL: http://arxiv.org/abs/2209.08372v2
- Date: Fri, 14 Jul 2023 11:01:45 GMT
- Title: CodeQueries: A Dataset of Semantic Queries over Code
- Authors: Surya Prakash Sahu, Madhurima Mandal, Shikhar Bharadwaj, Aditya
Kanade, Petros Maniatis, Shirish Shevade
- Abstract summary: We contribute a labeled dataset, called CodeQueries, of semantic queries over Python code.
Compared to the existing datasets, in CodeQueries, the queries are about code semantics, the context is file level and the answers are code spans.
We evaluate a large language model (GPT3.5-Turbo) in zero-shot and few-shot settings on a subset of CodeQueries.
- Score: 7.0864879068510005
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Developers often have questions about semantic aspects of code they are
working on, e.g., "Is there a class whose parent classes declare a conflicting
attribute?". Answering them requires understanding code semantics such as
attributes and inheritance relation of classes. An answer to such a question
should identify code spans constituting the answer (e.g., the declaration of
the subclass) as well as supporting facts (e.g., the definitions of the
conflicting attributes). The existing work on question-answering over code has
considered yes/no questions or method-level context. We contribute a labeled
dataset, called CodeQueries, of semantic queries over Python code. Compared to
the existing datasets, in CodeQueries, the queries are about code semantics,
the context is file level and the answers are code spans. We curate the dataset
based on queries supported by a widely-used static analysis tool, CodeQL, and
include both positive and negative examples, and queries requiring single-hop
and multi-hop reasoning.
To assess the value of our dataset, we evaluate baseline neural approaches.
We study a large language model (GPT3.5-Turbo) in zero-shot and few-shot
settings on a subset of CodeQueries. We also evaluate a BERT style model
(CuBERT) with fine-tuning. We find that these models achieve limited success on
CodeQueries. CodeQueries is thus a challenging dataset to test the ability of
neural models, to understand code semantics, in the extractive
question-answering setting.
Related papers
- Semantic Parsing for Conversational Question Answering over Knowledge
Graphs [63.939700311269156]
We develop a dataset where user questions are annotated with Sparql parses and system answers correspond to execution results thereof.
We present two different semantic parsing approaches and highlight the challenges of the task.
Our dataset and models are released at https://github.com/Edinburgh/SPICE.
arXiv Detail & Related papers (2023-01-28T14:45:11Z) - Generation-Augmented Query Expansion For Code Retrieval [51.20943646688115]
We propose a generation-augmented query expansion framework.
Inspired by the human retrieval process - sketching an answer before searching.
We achieve new state-of-the-art results on the CodeSearchNet benchmark.
arXiv Detail & Related papers (2022-12-20T23:49:37Z) - Unveiling Code Pre-Trained Models: Investigating Syntax and Semantics Capacities [34.27541293716398]
We extensively analyze seven code models to investigate how code models represent code syntax and semantics.
We have developed four probing tasks to evaluate the models' abilities to learn code syntax and semantics.
Our results emphasize the strengths and weaknesses of various code models in mastering code syntax and semantics.
arXiv Detail & Related papers (2022-12-20T06:15:17Z) - Python Code Generation by Asking Clarification Questions [57.63906360576212]
In this work, we introduce a novel and more realistic setup for this task.
We hypothesize that the under-specification of a natural language description can be resolved by asking clarification questions.
We collect and introduce a new dataset named CodeClarQA containing pairs of natural language descriptions and code with created synthetic clarification questions and answers.
arXiv Detail & Related papers (2022-12-19T22:08:36Z) - NS3: Neuro-Symbolic Semantic Code Search [33.583344165521645]
We use a Neural Module Network architecture to implement this idea.
We compare our model - NS3 (Neuro-Symbolic Semantic Search) - to a number of baselines, including state-of-the-art semantic code retrieval methods.
We demonstrate that our approach results in more precise code retrieval, and we study the effectiveness of our modular design when handling compositional queries.
arXiv Detail & Related papers (2022-05-21T20:55:57Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations.
For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name.
For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z) - Text Classification for Task-based Source Code Related Questions [0.0]
StackOverflow provides solutions in small snippets which provide a complete answer to whatever task question the developer wants to code.
We develop a two-fold deep learning model: Seq2Seq and a binary classifier that takes in the intent (which is in natural language) and code snippets in Python.
We find that the hidden state layer's embeddings perform slightly better than regular standard embeddings from a constructed vocabulary.
arXiv Detail & Related papers (2021-10-31T20:10:21Z) - CodeQA: A Question Answering Dataset for Source Code Comprehension [82.63394952538292]
Given a code snippet and a question, a textual answer is required to be generated.
CodeQA contains a Java dataset with 119,778 question-answer pairs and a Python dataset with 70,085 question-answer pairs.
arXiv Detail & Related papers (2021-09-17T06:06:38Z) - Is a Single Model Enough? MuCoS: A Multi-Model Ensemble Learning for
Semantic Code Search [22.9351865820122]
We propose MuCoS, a multi-model ensemble learning architecture for semantic code search.
We train the individual learners on different datasets which contain different perspectives of code information.
Then we ensemble the learners to capture comprehensive features of code snippets.
arXiv Detail & Related papers (2021-07-10T06:40:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.