Suggesting Code Edits in Interactive Machine Learning Notebooks Using Large Language Models
- URL: http://arxiv.org/abs/2501.09745v1
- Date: Thu, 16 Jan 2025 18:55:38 GMT
- Title: Suggesting Code Edits in Interactive Machine Learning Notebooks Using Large Language Models
- Authors: Bihui Jin, Jiayue Wang, Pengyu Nie,
- Abstract summary: We present the first dataset of 48,398 Jupyter notebook edits derived from 20,095 revisions of 792 machine learning repositories on GitHub.
Our dataset captures granular details of cell-level and line-level modifications, offering a foundation for understanding real-world maintenance patterns in machine learning.
- Score: 3.2433570328895196
- License:
- Abstract: Machine learning developers frequently use interactive computational notebooks, such as Jupyter notebooks, to host code for data processing and model training. Jupyter notebooks provide a convenient tool for writing machine learning pipelines and interactively observing outputs, however, maintaining Jupyter notebooks, e.g., to add new features or fix bugs, can be challenging due to the length and complexity of the notebooks. Moreover, there is no existing benchmark related to developer edits on Jupyter notebooks. To address this, we present the first dataset of 48,398 Jupyter notebook edits derived from 20,095 revisions of 792 machine learning repositories on GitHub, and perform the first study of the using LLMs to predict code edits in Jupyter notebooks. Our dataset captures granular details of cell-level and line-level modifications, offering a foundation for understanding real-world maintenance patterns in machine learning workflows. We observed that the edits on Jupyter notebooks are highly localized, with changes averaging only 166 lines of code in repositories. While larger models outperform smaller counterparts in code editing, all models have low accuracy on our dataset even after finetuning, demonstrating the complexity of real-world machine learning maintenance tasks. Our findings emphasize the critical role of contextual information in improving model performance and point toward promising avenues for advancing large language models' capabilities in engineering machine learning code.
Related papers
- Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates [57.29125360837203]
Cookbook is a framework that generates training data consisting of simple patterns over random tokens.
We find that finetuning on Cookbook-generated data is able to improve performance on its corresponding task by up to 52.7 accuracy points.
arXiv Detail & Related papers (2024-10-07T17:29:40Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - Static Analysis Driven Enhancements for Comprehension in Machine Learning Notebooks [7.142786325863891]
Jupyter notebooks enable developers to interleave code snippets with rich-text and in-line visualizations.
Recent studies have demonstrated that a large portion of Jupyter notebooks are undocumented and lacks a narrative structure.
This paper presents HeaderGen, a novel tool-based approach that automatically annotates code cells with categorical markdown headers.
arXiv Detail & Related papers (2023-01-11T11:57:52Z) - Natural Language to Code Generation in Interactive Data Science
Notebooks [35.621936471322385]
We build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks.
We develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs.
arXiv Detail & Related papers (2022-12-19T05:06:00Z) - EditEval: An Instruction-Based Benchmark for Text Improvements [73.5918084416016]
This work presents EditEval: An instruction-based, benchmark and evaluation suite for automatic evaluation of editing capabilities.
We evaluate several pre-trained models, which shows that InstructGPT and PEER perform the best, but that most baselines fall below the supervised SOTA.
Our analysis shows that commonly used metrics for editing tasks do not always correlate well, and that optimization for prompts with the highest performance does not necessarily entail the strongest robustness to different models.
arXiv Detail & Related papers (2022-09-27T12:26:05Z) - Pynblint: a Static Analyzer for Python Jupyter Notebooks [10.190501703364234]
Pynblint is a static analyzer for Jupyter notebooks written in Python.
It checks compliance of notebooks (and surrounding repositories) with a set of empirically validated best practices.
arXiv Detail & Related papers (2022-05-24T09:56:03Z) - StickyLand: Breaking the Linear Presentation of Computational Notebooks [5.1175396458764855]
StickyLand is a notebook extension for empowering users to freely organize their code in non-linear ways.
With sticky cells that are always shown on the screen, users can quickly access their notes, instantly observe experiment results, and easily build interactive dashboards.
arXiv Detail & Related papers (2022-02-22T18:25:54Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - ProtoTransformer: A Meta-Learning Approach to Providing Student Feedback [54.142719510638614]
In this paper, we frame the problem of providing feedback as few-shot classification.
A meta-learner adapts to give feedback to student code on a new programming question from just a few examples by instructors.
Our approach was successfully deployed to deliver feedback to 16,000 student exam-solutions in a programming course offered by a tier 1 university.
arXiv Detail & Related papers (2021-07-23T22:41:28Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z) - ReproduceMeGit: A Visualization Tool for Analyzing Reproducibility of
Jupyter Notebooks [0.0]
We present ReproduceMeGit, a visualization tool for analyzing the GitHub of Jupyter Notebooks.
The tool provides information on the number of notebooks that were successfully reproducible, those that resulted in exceptions, those with different results from the original notebooks, etc.
arXiv Detail & Related papers (2020-06-22T10:05:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.