Related papers: CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow

CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow

URL: http://arxiv.org/abs/2409.16819v1
Date: Wed, 25 Sep 2024 11:18:52 GMT
Title: CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow
Authors: Nathanaël Beau, Benoît Crabbé,
Abstract summary: dataset provides examples that include a clarified intent, code snippets associated, and an average of three related unit tests. Comprising 3,409 crafted examples by Python experts, our dataset is designed for both model finetuning and standalone evaluation.
Score: 10.19019476978683
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce a novel dataset tailored for code generation, aimed at aiding developers in common tasks. Our dataset provides examples that include a clarified intent, code snippets associated, and an average of three related unit tests. It encompasses a range of libraries such as \texttt{Pandas}, \texttt{Numpy}, and \texttt{Regex}, along with more than 70 standard libraries in Python code derived from Stack Overflow. Comprising 3,409 crafted examples by Python experts, our dataset is designed for both model finetuning and standalone evaluation. To complete unit tests evaluation, we categorize examples in order to get more fine grained analysis, enhancing the understanding of models' strengths and weaknesses in specific coding tasks. The examples have been refined to reduce data contamination, a process confirmed by the performance of three leading models: Mistral 7B, CodeLLaMa 13B, and Starcoder 15B. We further investigate data-contamination testing GPT-4 performance on a part of our dataset. The benchmark can be accessed at \url{https://github.com/NathanaelBeau/CodeInsight}.

Related papers

OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs [62.68905180014956]
We introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples. Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments. We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset.
arXiv Detail & Related papers (2025-04-05T02:52:16Z)
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding [49.56049319037421]
KodCode is a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data. It comprises question-solution-test triplets that are systematically validated via a self-verification procedure. This pipeline yields a large-scale, robust and diverse coding dataset.
arXiv Detail & Related papers (2025-03-04T19:17:36Z)
UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z)
CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases. The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z)
Optimizing Datasets for Code Summarization: Is Code-Comment Coherence Enough? [11.865113785648932]
We explore the extent to which code-comment coherence, a specific quality attribute of code summaries, can be used to optimize code summarization datasets. We examine multiple levels of training instances from two state-of-the-art datasets (TL-CodeSum and Funcom) and evaluate the resulting models on three manually curated test sets.
arXiv Detail & Related papers (2025-02-11T15:02:19Z)
Leveraging Large Language Models in Code Question Answering: Baselines and Issues [0.1617522438111378]
This paper presents a work devoted to using large language models for question answering over source code in Python. The proposed method for implementing a source code question answering system involves fine-tuning a large language model on a unified dataset of questions and answers for Python code. We report BLEU-4, BERTScore F1, BLEURT, and Exact Match metric values, along with the conclusions from the manual error analysis.
arXiv Detail & Related papers (2024-11-05T11:25:12Z)
Contextualized Data-Wrangling Code Generation in Computational Notebooks [131.26365849822932]
We propose an automated approach, CoCoMine, to mine data-wrangling code generation examples with clear multi-modal contextual dependency. We construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks. Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation.
arXiv Detail & Related papers (2024-09-20T14:49:51Z)
How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data [26.836532205017104]
We find that many datasets suffer from severe data leakage. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data. We present XCoder, a family of models finetuned from LLaMA3.
arXiv Detail & Related papers (2024-09-05T17:46:30Z)
On Leakage of Code Generation Evaluation Datasets [44.4726918027046]
We consider contamination by code generation test sets, in particular in their use in modern large language models. To address this, we release Less Basic Python Problems (LBPP): an uncontaminated new benchmark of 161 prompts with their associated Python solutions.
arXiv Detail & Related papers (2024-07-10T11:50:20Z)
CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [56.691926887209895]
We present textbfname (textbfInformation textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities. name comprises textbften meticulously curated code datasets, spanning textbfeight distinctive retrieval tasks across textbfseven diverse domains. We evaluate nine widely used retrieval models using name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z)
Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain. We propose an adversarial algorithm to make the retriever component robust against distribution shift. We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z)
CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code. We conduct a human study to identify the criteria for high-quality explanatory docstring for code. We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z)
Execution-based Evaluation for Data Science Code Generation Models [97.96608263010913]
We introduce ExeDS, an evaluation dataset for execution evaluation for data science code generation tasks. ExeDS contains a set of 534 problems from Jupyter Notebooks, each consisting of code context, task description, reference program, and desired execution output. We evaluate the execution performance of five state-of-the-art code generation models that have achieved high surface-form evaluation scores.
arXiv Detail & Related papers (2022-11-17T07:04:11Z)
On the Importance of Building High-quality Training Datasets for Neural Code Search [15.557818317497397]
We propose a data cleaning framework consisting of two subsequent filters: a rule-based syntactic filter and a model-based semantic filter. We evaluate the effectiveness of our framework on two widely-used code search models and three manually-annotated code retrieval benchmarks.
arXiv Detail & Related papers (2022-02-14T12:02:41Z)
KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation [100.79870384880333]
We propose a knowledge-grounded pre-training (KGPT) to generate knowledge-enriched text. We adopt three settings, namely fully-supervised, zero-shot, few-shot to evaluate its effectiveness. Under zero-shot setting, our model achieves over 30 ROUGE-L on WebNLG while all other baselines fail.
arXiv Detail & Related papers (2020-10-05T19:59:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.