Observing Fine-Grained Changes in Jupyter Notebooks During Development Time
- URL: http://arxiv.org/abs/2507.15831v2
- Date: Thu, 24 Jul 2025 22:24:46 GMT
- Title: Observing Fine-Grained Changes in Jupyter Notebooks During Development Time
- Authors: Sergey Titov, Konstantin Grotov, Cristina Sarasua, Yaroslav Golubev, Dhivyabharathi Ramasamy, Alberto Bacchelli, Abraham Bernstein, Timofey Bryksin,
- Abstract summary: We introduce a toolset for collecting code changes in Jupyter notebooks during development time.<n>We then use it to collect more than 100 hours of work related to a data analysis task and a machine learning task.<n>In our analysis of the collected data, we classified the changes made to the cells between executions and found that a significant number of these changes were code modifications.
- Score: 12.75622665542759
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In software engineering, numerous studies have focused on the analysis of fine-grained logs, leading to significant innovations in areas such as refactoring, security, and code completion. However, no similar studies have been conducted for computational notebooks in the context of data science. To help bridge this research gap, we make three scientific contributions: we (1) introduce a toolset for collecting code changes in Jupyter notebooks during development time; (2) use it to collect more than 100 hours of work related to a data analysis task and a machine learning task (carried out by 20 developers with different levels of expertise), resulting in a dataset containing 2,655 cells and 9,207 cell executions; and (3) use this dataset to investigate the dynamic nature of the notebook development process and the changes that take place in the notebooks. In our analysis of the collected data, we classified the changes made to the cells between executions and found that a significant number of these changes were code iteration modifications. We report a number of other insights and propose potential future research directions on the novel data.
Related papers
- A Systematic Literature Review of Software Engineering Research on Jupyter Notebook [8.539234346904905]
The purpose of this study is to analyze trends, gaps, and methodologies used in software engineering research on Jupyter notebooks.<n>The most popular venues for publishing software engineering research on Jupyter notebooks are related to human-computer interaction.
arXiv Detail & Related papers (2025-04-22T18:12:04Z) - Suggesting Code Edits in Interactive Machine Learning Notebooks Using Large Language Models [3.2433570328895196]
We present the first dataset of 48,398 Jupyter notebook edits derived from 20,095 revisions of 792 machine learning repositories on GitHub.<n>Our dataset captures granular details of cell-level and line-level modifications, offering a foundation for understanding real-world maintenance patterns in machine learning.
arXiv Detail & Related papers (2025-01-16T18:55:38Z) - Contextualized Data-Wrangling Code Generation in Computational Notebooks [131.26365849822932]
We propose an automated approach, CoCoMine, to mine data-wrangling code generation examples with clear multi-modal contextual dependency.
We construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks.
Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation.
arXiv Detail & Related papers (2024-09-20T14:49:51Z) - Untangling Knots: Leveraging LLM for Error Resolution in Computational Notebooks [4.318590074766604]
We propose a potential solution for resolving errors in computational notebooks via an iterative LLM-based agent.
We discuss the questions raised by this approach and share a novel dataset of computational notebooks containing bugs.
arXiv Detail & Related papers (2024-03-26T18:53:17Z) - A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond [84.95530356322621]
This survey presents a systematic review of the advancements in code intelligence.<n>It covers over 50 representative models and their variants, more than 20 categories of tasks, and an extensive coverage of over 680 related works.<n>Building on our examination of the developmental trajectories, we further investigate the emerging synergies between code intelligence and broader machine intelligence.
arXiv Detail & Related papers (2024-03-21T08:54:56Z) - Investigating Reproducibility in Deep Learning-Based Software Fault
Prediction [16.25827159504845]
With the rapid adoption of increasingly complex machine learning models, it becomes more and more difficult for scholars to reproduce the results that are reported in the literature.
This is in particular the case when the applied deep learning models and the evaluation methodology are not properly documented and when code and data are not shared.
We have conducted a systematic review of the current literature and examined the level of 56 research articles that were published between 2019 and 2022 in top-tier software engineering conferences.
arXiv Detail & Related papers (2024-02-08T13:00:18Z) - CodeLL: A Lifelong Learning Dataset to Support the Co-Evolution of Data
and Language Models of Code [6.491009626125319]
We introduce CodeLL, a lifelong learning dataset focused on code changes.
Our dataset aims to comprehensively capture code changes across the entire release history of open-source software repositories.
CodeLL enables researchers studying the behaviour of LMs in lifelong fine-tuning settings for learning code changes.
arXiv Detail & Related papers (2023-12-20T01:20:24Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Privacy-Preserving Graph Machine Learning from Data to Computation: A
Survey [67.7834898542701]
We focus on reviewing privacy-preserving techniques of graph machine learning.
We first review methods for generating privacy-preserving graph data.
Then we describe methods for transmitting privacy-preserved information.
arXiv Detail & Related papers (2023-07-10T04:30:23Z) - Mining the Characteristics of Jupyter Notebooks in Data Science Projects [1.655246222110267]
The computational notebook (e.g., Jupyter Notebook) is a well-known data science tool adopted in practice.<n>This research aims to understand the characteristics of high-voted Jupyter Notebooks on Kaggle and the popular Jupyter Notebooks for data science projects on GitHub.
arXiv Detail & Related papers (2023-04-11T16:30:53Z) - Natural Language to Code Generation in Interactive Data Science
Notebooks [35.621936471322385]
We build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks.
We develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs.
arXiv Detail & Related papers (2022-12-19T05:06:00Z) - Time-Varying Propensity Score to Bridge the Gap between the Past and Present [104.46387765330142]
We introduce a time-varying propensity score that can detect gradual shifts in the distribution of data.
We demonstrate different ways of implementing it and evaluate it on a variety of problems.
arXiv Detail & Related papers (2022-10-04T07:21:49Z) - dMelodies: A Music Dataset for Disentanglement Learning [70.90415511736089]
We present a new symbolic music dataset that will help researchers demonstrate the efficacy of their algorithms on diverse domains.
This will also provide a means for evaluating algorithms specifically designed for music.
The dataset is large enough (approx. 1.3 million data points) to train and test deep networks for disentanglement learning.
arXiv Detail & Related papers (2020-07-29T19:20:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.