Related papers: CodeQA: A Question Answering Dataset for Source Code Comprehension

CodeQA: A Question Answering Dataset for Source Code Comprehension

URL: http://arxiv.org/abs/2109.08365v1
Date: Fri, 17 Sep 2021 06:06:38 GMT
Title: CodeQA: A Question Answering Dataset for Source Code Comprehension
Authors: Chenxiao Liu, Xiaojun Wan
Abstract summary: Given a code snippet and a question, a textual answer is required to be generated. CodeQA contains a Java dataset with 119,778 question-answer pairs and a Python dataset with 70,085 question-answer pairs.
Score: 82.63394952538292
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose CodeQA, a free-form question answering dataset for the purpose of source code comprehension: given a code snippet and a question, a textual answer is required to be generated. CodeQA contains a Java dataset with 119,778 question-answer pairs and a Python dataset with 70,085 question-answer pairs. To obtain natural and faithful questions and answers, we implement syntactic rules and semantic analysis to transform code comments into question-answer pairs. We present the construction process and conduct systematic analysis of our dataset. Experiment results achieved by several neural baselines on our dataset are shown and discussed. While research on question-answering and machine reading comprehension develops rapidly, few prior work has drawn attention to code question answering. This new dataset can serve as a useful research benchmark for source code comprehension.

Related papers

Leveraging Large Language Models in Code Question Answering: Baselines and Issues [0.1617522438111378]
This paper presents a work devoted to using large language models for question answering over source code in Python. The proposed method for implementing a source code question answering system involves fine-tuning a large language model on a unified dataset of questions and answers for Python code. We report BLEU-4, BERTScore F1, BLEURT, and Exact Match metric values, along with the conclusions from the manual error analysis.
arXiv Detail & Related papers (2024-11-05T11:25:12Z)
PCoQA: Persian Conversational Question Answering Dataset [12.07607688189035]
The PCoQA dataset is a resource comprising information-seeking dialogs encompassing a total of 9,026 contextually-driven questions. PCoQA is designed to present novel challenges compared to previous question answering datasets. This paper not only presents the comprehensive PCoQA dataset but also reports the performance of various benchmark models.
arXiv Detail & Related papers (2023-12-07T15:29:34Z)
Python Code Generation by Asking Clarification Questions [57.63906360576212]
In this work, we introduce a novel and more realistic setup for this task. We hypothesize that the under-specification of a natural language description can be resolved by asking clarification questions. We collect and introduce a new dataset named CodeClarQA containing pairs of natural language descriptions and code with created synthetic clarification questions and answers.
arXiv Detail & Related papers (2022-12-19T22:08:36Z)
CS1QA: A Dataset for Assisting Code-based Question Answering in an Introductory Programming Course [13.61096948994569]
CS1QA consists of 9,237 question-answer pairs gathered from chat logs in an introductory programming class using Python. Each question is accompanied with the student's code, and the portion of the code relevant to answering the question.
arXiv Detail & Related papers (2022-10-26T05:40:34Z)
Modern Question Answering Datasets and Benchmarks: A Survey [5.026863544662493]
Question Answering (QA) is one of the most important natural language processing (NLP) tasks. It aims using NLP technologies to generate a corresponding answer to a given question based on the massive unstructured corpus. In this paper, we investigate influential QA datasets that have been released in the era of deep learning.
arXiv Detail & Related papers (2022-06-30T05:53:56Z)
ComQA:Compositional Question Answering via Hierarchical Graph Neural Networks [47.12013005600986]
We present a large-scale compositional question answering dataset containing more than 120k human-labeled questions. To tackle the ComQA problem, we proposed a hierarchical graph neural networks, which represents the document from the low-level word to the high-level sentence. Our proposed model achieves a significant improvement over previous machine reading comprehension methods and pre-training methods.
arXiv Detail & Related papers (2021-01-16T08:23:27Z)
Understanding Unnatural Questions Improves Reasoning over Text [54.235828149899625]
Complex question answering (CQA) over raw text is a challenging task. Learning an effective CQA model requires large amounts of human-annotated data. We address the challenge of learning a high-quality programmer (parser) by projecting natural human-generated questions into unnatural machine-generated questions.
arXiv Detail & Related papers (2020-10-19T10:22:16Z)
Inquisitive Question Generation for High Level Text Comprehension [60.21497846332531]
We introduce INQUISITIVE, a dataset of 19K questions that are elicited while a person is reading through a document. We show that readers engage in a series of pragmatic strategies to seek information. We evaluate question generation models based on GPT-2 and show that our model is able to generate reasonable questions.
arXiv Detail & Related papers (2020-10-04T19:03:39Z)
Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space [94.8320535537798]
Controllable Rewriting based Question Data Augmentation (CRQDA) for machine reading comprehension (MRC), question generation, and question-answering natural language inference tasks. We treat the question data augmentation task as a constrained question rewriting problem to generate context-relevant, high-quality, and diverse question data samples.
arXiv Detail & Related papers (2020-10-04T03:13:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.