Mining the Characteristics of Jupyter Notebooks in Data Science Projects
- URL: http://arxiv.org/abs/2304.05325v2
- Date: Sat, 26 Apr 2025 07:31:56 GMT
- Title: Mining the Characteristics of Jupyter Notebooks in Data Science Projects
- Authors: Morakot Choetkiertikul, Apirak Hoonlor, Chaiyong Ragkhitwetsagul, Siripen Pongpaichet, Thanwadee Sunetnanta, Tasha Settewong, Vacharavich Jiravatvanich, Urisayar Kaewpichai, Raula Gaikovina Kula,
- Abstract summary: The computational notebook (e.g., Jupyter Notebook) is a well-known data science tool adopted in practice.<n>This research aims to understand the characteristics of high-voted Jupyter Notebooks on Kaggle and the popular Jupyter Notebooks for data science projects on GitHub.
- Score: 1.655246222110267
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Nowadays, numerous industries have exceptional demand for skills in data science, such as data analysis, data mining, and machine learning. The computational notebook (e.g., Jupyter Notebook) is a well-known data science tool adopted in practice. Kaggle and GitHub are two platforms where data science communities are used for knowledge-sharing, skill-practicing, and collaboration. While tutorials and guidelines for novice data science are available on both platforms, there is a low number of Jupyter Notebooks that received high numbers of votes from the community. The high-voted notebook is considered well-documented, easy to understand, and applies the best data science and software engineering practices. In this research, we aim to understand the characteristics of high-voted Jupyter Notebooks on Kaggle and the popular Jupyter Notebooks for data science projects on GitHub. We plan to mine and analyse the Jupyter Notebooks on both platforms. We will perform exploratory analytics, data visualization, and feature importances to understand the overall structure of these notebooks and to identify common patterns and best-practice features separating the low-voted and high-voted notebooks. Upon the completion of this research, the discovered insights can be applied as training guidelines for aspiring data scientists and machine learning practitioners looking to improve their performance from novice ranking Jupyter Notebook on Kaggle to a deployable project on GitHub.
Related papers
- A Systematic Literature Review of Software Engineering Research on Jupyter Notebook [8.539234346904905]
The purpose of this study is to analyze trends, gaps, and methodologies used in software engineering research on Jupyter notebooks.<n>The most popular venues for publishing software engineering research on Jupyter notebooks are related to human-computer interaction.
arXiv Detail & Related papers (2025-04-22T18:12:04Z) - Suggesting Code Edits in Interactive Machine Learning Notebooks Using Large Language Models [3.2433570328895196]
We present the first dataset of 48,398 Jupyter notebook edits derived from 20,095 revisions of 792 machine learning repositories on GitHub.<n>Our dataset captures granular details of cell-level and line-level modifications, offering a foundation for understanding real-world maintenance patterns in machine learning.
arXiv Detail & Related papers (2025-01-16T18:55:38Z) - Exploring Text-to-Motion Generation with Human Preference [59.28730218998923]
This paper presents an exploration of preference learning in text-to-motion generation.
We find that current improvements in text-to-motion generation still rely on datasets requiring expert labelers with motion capture systems.
We show that preference learning has the potential to greatly improve current text-to-motion generative models.
arXiv Detail & Related papers (2024-04-15T04:14:42Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Notably Inaccessible -- Data Driven Understanding of Data Science
Notebook (In)Accessibility [13.428631054625797]
We perform a large scale systematic analysis of 100000 Jupyter notebooks to identify various accessibility challenges.
We make recommendations to improve accessibility of the artifacts of a notebook, suggest authoring practices, and propose changes to infrastructure to make notebooks accessible.
arXiv Detail & Related papers (2023-08-07T01:33:32Z) - The Semantic Scholar Open Data Platform [92.2948743167744]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.<n>We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.<n>The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - Deep learning for table detection and structure recognition: A survey [49.09628624903334]
The goal of this survey is to provide a profound comprehension of the major developments in the field of Table Detection.
We provide an analysis of both classic and new applications in the field.
The datasets and source code of the existing models are organized to provide the reader with a compass on this vast literature.
arXiv Detail & Related papers (2022-11-15T19:42:27Z) - StickyLand: Breaking the Linear Presentation of Computational Notebooks [5.1175396458764855]
StickyLand is a notebook extension for empowering users to freely organize their code in non-linear ways.
With sticky cells that are always shown on the screen, users can quickly access their notes, instantly observe experiment results, and easily build interactive dashboards.
arXiv Detail & Related papers (2022-02-22T18:25:54Z) - DeepShovel: An Online Collaborative Platform for Data Extraction in
Geoscience Literature with AI Assistance [48.55345030503826]
Geoscientists need to read a huge amount of literature to locate, extract, and aggregate relevant results and data.
DeepShovel is a publicly-available AI-assisted data extraction system to support their needs.
A follow-up user evaluation with 14 researchers suggested DeepShovel improved users' efficiency of data extraction for building scientific databases.
arXiv Detail & Related papers (2022-02-21T12:18:08Z) - GIS and Computational Notebooks [0.0]
This chapter introduces computational notebooks in the geographical context.
It begins by explaining the computational paradigm and philosophy that underlies notebooks.
It then unpacks their architecture to illustrate a notebook user's typical workflow.
arXiv Detail & Related papers (2021-01-02T01:59:14Z) - Scaling Systematic Literature Reviews with Machine Learning Pipelines [57.82662094602138]
Systematic reviews entail the extraction of data from scientific documents.
We construct a pipeline that automates each of these aspects, and experiment with many human-time vs. system quality trade-offs.
We find that we can get surprising accuracy and generalisability of the whole pipeline system with only 2 weeks of human-expert annotation.
arXiv Detail & Related papers (2020-10-09T16:19:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.