CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow
- URL: http://arxiv.org/abs/2409.16819v1
- Date: Wed, 25 Sep 2024 11:18:52 GMT
- Title: CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow
- Authors: Nathanaël Beau, Benoît Crabbé,
- Abstract summary: dataset provides examples that include a clarified intent, code snippets associated, and an average of three related unit tests.
Comprising 3,409 crafted examples by Python experts, our dataset is designed for both model finetuning and standalone evaluation.
- Score: 10.19019476978683
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a novel dataset tailored for code generation, aimed at aiding developers in common tasks. Our dataset provides examples that include a clarified intent, code snippets associated, and an average of three related unit tests. It encompasses a range of libraries such as \texttt{Pandas}, \texttt{Numpy}, and \texttt{Regex}, along with more than 70 standard libraries in Python code derived from Stack Overflow. Comprising 3,409 crafted examples by Python experts, our dataset is designed for both model finetuning and standalone evaluation. To complete unit tests evaluation, we categorize examples in order to get more fine grained analysis, enhancing the understanding of models' strengths and weaknesses in specific coding tasks. The examples have been refined to reduce data contamination, a process confirmed by the performance of three leading models: Mistral 7B, CodeLLaMa 13B, and Starcoder 15B. We further investigate data-contamination testing GPT-4 performance on a part of our dataset. The benchmark can be accessed at \url{https://github.com/NathanaelBeau/CodeInsight}.
Related papers
- Leveraging Large Language Models in Code Question Answering: Baselines and Issues [0.1617522438111378]
This paper presents a work devoted to using large language models for question answering over source code in Python.
The proposed method for implementing a source code question answering system involves fine-tuning a large language model on a unified dataset of questions and answers for Python code.
We report BLEU-4, BERTScore F1, BLEURT, and Exact Match metric values, along with the conclusions from the manual error analysis.
arXiv Detail & Related papers (2024-11-05T11:25:12Z) - Contextualized Data-Wrangling Code Generation in Computational Notebooks [131.26365849822932]
We propose an automated approach, CoCoMine, to mine data-wrangling code generation examples with clear multi-modal contextual dependency.
We construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks.
Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation.
arXiv Detail & Related papers (2024-09-20T14:49:51Z) - How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data [26.836532205017104]
We find that many datasets suffer from severe data leakage.
This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data.
We present XCoder, a family of models finetuned from LLaMA3.
arXiv Detail & Related papers (2024-09-05T17:46:30Z) - On Leakage of Code Generation Evaluation Datasets [44.4726918027046]
We consider contamination by code generation test sets, in particular in their use in modern large language models.
To address this, we release Less Basic Python Problems (LBPP): an uncontaminated new benchmark of 161 prompts with their associated Python solutions.
arXiv Detail & Related papers (2024-07-10T11:50:20Z) - CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [56.691926887209895]
We present textbfname (textbfInformation textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities.
name comprises textbften meticulously curated code datasets, spanning textbfeight distinctive retrieval tasks across textbfseven diverse domains.
We evaluate nine widely used retrieval models using name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z) - Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z) - Execution-based Evaluation for Data Science Code Generation Models [97.96608263010913]
We introduce ExeDS, an evaluation dataset for execution evaluation for data science code generation tasks.
ExeDS contains a set of 534 problems from Jupyter Notebooks, each consisting of code context, task description, reference program, and desired execution output.
We evaluate the execution performance of five state-of-the-art code generation models that have achieved high surface-form evaluation scores.
arXiv Detail & Related papers (2022-11-17T07:04:11Z) - On the Importance of Building High-quality Training Datasets for Neural
Code Search [15.557818317497397]
We propose a data cleaning framework consisting of two subsequent filters: a rule-based syntactic filter and a model-based semantic filter.
We evaluate the effectiveness of our framework on two widely-used code search models and three manually-annotated code retrieval benchmarks.
arXiv Detail & Related papers (2022-02-14T12:02:41Z) - KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation [100.79870384880333]
We propose a knowledge-grounded pre-training (KGPT) to generate knowledge-enriched text.
We adopt three settings, namely fully-supervised, zero-shot, few-shot to evaluate its effectiveness.
Under zero-shot setting, our model achieves over 30 ROUGE-L on WebNLG while all other baselines fail.
arXiv Detail & Related papers (2020-10-05T19:59:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.