CoDesc: A Large Code-Description Parallel Dataset
- URL: http://arxiv.org/abs/2105.14220v1
- Date: Sat, 29 May 2021 05:40:08 GMT
- Title: CoDesc: A Large Code-Description Parallel Dataset
- Authors: Masum Hasan, Tanveer Muttaqueen, Abdullah Al Ishtiaq, Kazi Sajeed
Mehrab, Md. Mahim Anjum Haque, Tahmid Hasan, Wasi Uddin Ahmad, Anindya Iqbal,
Rifat Shahriyar
- Abstract summary: We present CoDesc -- a large parallel dataset composed of 4.2 million Java methods and natural language descriptions.
With extensive analysis, we identify and remove prevailing noise patterns from the dataset.
We show that the dataset helps improve code search by up to 22% and achieves the new state-of-the-art in code summarization.
- Score: 4.828053113572208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Translation between natural language and source code can help software
development by enabling developers to comprehend, ideate, search, and write
computer programs in natural language. Despite growing interest from the
industry and the research community, this task is often difficult due to the
lack of large standard datasets suitable for training deep neural models,
standard noise removal methods, and evaluation benchmarks. This leaves
researchers to collect new small-scale datasets, resulting in inconsistencies
across published works. In this study, we present CoDesc -- a large parallel
dataset composed of 4.2 million Java methods and natural language descriptions.
With extensive analysis, we identify and remove prevailing noise patterns from
the dataset. We demonstrate the proficiency of CoDesc in two complementary
tasks for code-description pairs: code summarization and code search. We show
that the dataset helps improve code search by up to 22\% and achieves the new
state-of-the-art in code summarization. Furthermore, we show CoDesc's
effectiveness in pre-training--fine-tuning setup, opening possibilities in
building pretrained language models for Java. To facilitate future research, we
release the dataset, a data processing tool, and a benchmark at
\url{https://github.com/csebuetnlp/CoDesc}.
Related papers
- Contextualized Data-Wrangling Code Generation in Computational Notebooks [131.26365849822932]
We propose an automated approach, CoCoMine, to mine data-wrangling code generation examples with clear multi-modal contextual dependency.
We construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks.
Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation.
arXiv Detail & Related papers (2024-09-20T14:49:51Z) - Constructing Multilingual Code Search Dataset Using Neural Machine
Translation [48.32329232202801]
We create a multilingual code search dataset in four natural and four programming languages.
Our results show that the model pre-trained with all natural and programming language data has performed best in most cases.
arXiv Detail & Related papers (2023-06-27T16:42:36Z) - The Vault: A Comprehensive Multilingual Dataset for Advancing Code
Understanding and Generation [5.2510537676167335]
We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages.
Our evaluations show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet.
arXiv Detail & Related papers (2023-05-09T09:35:03Z) - Python Code Generation by Asking Clarification Questions [57.63906360576212]
In this work, we introduce a novel and more realistic setup for this task.
We hypothesize that the under-specification of a natural language description can be resolved by asking clarification questions.
We collect and introduce a new dataset named CodeClarQA containing pairs of natural language descriptions and code with created synthetic clarification questions and answers.
arXiv Detail & Related papers (2022-12-19T22:08:36Z) - XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence [9.673614921946932]
This paper introduces XLCoST, Cross-Lingual Code SnippeT dataset, a new benchmark dataset for cross-lingual code intelligence.
Our dataset contains fine-grained parallel data from 8 languages, and supports 10 cross-lingual code tasks.
arXiv Detail & Related papers (2022-06-16T22:49:39Z) - Using Document Similarity Methods to create Parallel Datasets for Code
Translation [60.36392618065203]
Translating source code from one programming language to another is a critical, time-consuming task.
We propose to use document similarity methods to create noisy parallel datasets of code.
We show that these models perform comparably to models trained on ground truth for reasonable levels of noise.
arXiv Detail & Related papers (2021-10-11T17:07:58Z) - Deep Graph Matching and Searching for Semantic Code Retrieval [76.51445515611469]
We propose an end-to-end deep graph matching and searching model based on graph neural networks.
We first represent both natural language query texts and programming language code snippets with the unified graph-structured data.
In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them.
arXiv Detail & Related papers (2020-10-24T14:16:50Z) - Code to Comment "Translation": Data, Metrics, Baselining & Evaluation [49.35567240750619]
We analyze several recent code-comment datasets for this task.
We compare them with WMT19, a standard dataset frequently used to train state of the art natural language translators.
We find some interesting differences between the code-comment data and the WMT19 natural language data.
arXiv Detail & Related papers (2020-10-03T18:57:26Z) - Leveraging Code Generation to Improve Code Retrieval and Summarization
via Dual Learning [18.354352985591305]
Code summarization generates brief natural language description given a source code snippet, while code retrieval fetches relevant source code given a natural language query.
Recent studies have combined these two tasks to improve their performance.
We propose a novel end-to-end model for the two tasks by introducing an additional code generation task.
arXiv Detail & Related papers (2020-02-24T12:26:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.