Related papers: CodeLL: A Lifelong Learning Dataset to Support the Co-Evolution of Data and Language Models of Code

CodeLL: A Lifelong Learning Dataset to Support the Co-Evolution of Data and Language Models of Code

URL: http://arxiv.org/abs/2312.12492v1
Date: Wed, 20 Dec 2023 01:20:24 GMT
Title: CodeLL: A Lifelong Learning Dataset to Support the Co-Evolution of Data and Language Models of Code
Authors: Martin Weyssow, Claudio Di Sipio, Davide Di Ruscio, and Houari Sahraoui
Abstract summary: We introduce CodeLL, a lifelong learning dataset focused on code changes. Our dataset aims to comprehensively capture code changes across the entire release history of open-source software repositories. CodeLL enables researchers studying the behaviour of LMs in lifelong fine-tuning settings for learning code changes.
Score: 6.491009626125319
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Motivated by recent work on lifelong learning applications for language models (LMs) of code, we introduce CodeLL, a lifelong learning dataset focused on code changes. Our contribution addresses a notable research gap marked by the absence of a long-term temporal dimension in existing code change datasets, limiting their suitability in lifelong learning scenarios. In contrast, our dataset aims to comprehensively capture code changes across the entire release history of open-source software repositories. In this work, we introduce an initial version of CodeLL, comprising 71 machine-learning-based projects mined from Software Heritage. This dataset enables the extraction and in-depth analysis of code changes spanning 2,483 releases at both the method and API levels. CodeLL enables researchers studying the behaviour of LMs in lifelong fine-tuning settings for learning code changes. Additionally, the dataset can help studying data distribution shifts within software repositories and the evolution of API usages over time.

Related papers

A Vulnerability Code Intent Summary Dataset [3.609135490386991]
This paper proposes an innovative large-scale multi-perspective Code Intent Summary dataset named BADS. It aims to increase the understanding of a given code snippet and reduce the risk in the code developing process. Our dataset and related tools have been publicly released on GitHub.
arXiv Detail & Related papers (2025-04-11T00:39:50Z)
ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding [60.37988508851391]
Language models (LMs) have become a staple of the code-writing toolbox. Research exploring modifications to Code-LMs' pre-training objectives, geared towards improving data efficiency and better disentangling between syntax and semantics, has been noticeably sparse. In this work, we examine grounding on obfuscated code as a means of helping Code-LMs look beyond the surface-form syntax and enhance their pre-training sample efficiency.
arXiv Detail & Related papers (2025-03-27T23:08:53Z)
CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale [39.54772602678732]
This paper introduces CODESYNC, a data engine for identifying outdated code patterns. Building upon CODESYNC, we develop CODESYNCBENCH, a benchmark for assessing Large Language Models' ability to stay synchronized with code evolution.
arXiv Detail & Related papers (2025-02-23T16:46:18Z)
SnipGen: A Mining Repository Framework for Evaluating LLMs for Code [51.07471575337676]
Language Models (LLMs) are trained on extensive datasets that include code repositories. evaluating their effectiveness poses significant challenges due to the potential overlap between the datasets used for training and those employed for evaluation. We introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation.
arXiv Detail & Related papers (2025-02-10T21:28:15Z)
Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities. The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z)
Codellm-Devkit: A Framework for Contextualizing Code LLMs with Program Analysis Insights [9.414198519543564]
We present codellm-devkit (hereafter, CLDK'), an open-source library that significantly simplifies the process of performing program analysis. CLDK offers developers an intuitive and user-friendly interface, making it incredibly easy to provide rich program analysis context to code LLMs.
arXiv Detail & Related papers (2024-10-16T20:05:59Z)
VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development. We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM) We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z)
A Survey on Large Language Models for Code Generation [9.555952109820392]
Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks. This survey aims to bridge the gap between academia and practical development by providing a comprehensive and up-to-date literature review.
arXiv Detail & Related papers (2024-06-01T17:48:15Z)
AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data [64.69872638349922]
We present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. We propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review.
arXiv Detail & Related papers (2024-05-29T16:57:33Z)
CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities. We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution. We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z)
Code Needs Comments: Enhancing Code LLMs with Comment Augmentation [91.52444946362547]
We introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language. We conducted experiments on three code-focused Large Language Models and observed consistent improvements in performance on two widely-used programming skill benchmarks.
arXiv Detail & Related papers (2024-02-20T13:56:38Z)
LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z)
InstructCoder: Instruction Tuning Large Language Models for Code Editing [26.160498475809266]
We explore the use of Large Language Models (LLMs) to edit code based on user instructions. InstructCoder is the first instruction-tuning dataset designed to adapt LLMs for general-purpose code editing. Our findings reveal that open-source LLMs fine-tuned on InstructCoder can significantly enhance the accuracy of code edits.
arXiv Detail & Related papers (2023-10-31T10:15:35Z)
TASTY: A Transformer based Approach to Space and Time complexity [0.4724825031148411]
Code based Language Models (LMs) have shown very promising results in the field of software engineering. We create a labelled dataset of code snippets spanning multiple languages. We propose to use LMs to find space complexities from code, and to the best of our knowledge, this is the first attempt to do so.
arXiv Detail & Related papers (2023-05-06T03:37:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.