Pynblint: a Static Analyzer for Python Jupyter Notebooks
- URL: http://arxiv.org/abs/2205.11934v1
- Date: Tue, 24 May 2022 09:56:03 GMT
- Title: Pynblint: a Static Analyzer for Python Jupyter Notebooks
- Authors: Luigi Quaranta, Fabio Calefato, Filippo Lanubile
- Abstract summary: Pynblint is a static analyzer for Jupyter notebooks written in Python.
It checks compliance of notebooks (and surrounding repositories) with a set of empirically validated best practices.
- Score: 10.190501703364234
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Jupyter Notebook is the tool of choice of many data scientists in the early
stages of ML workflows. The notebook format, however, has been criticized for
inducing bad programming practices; indeed, researchers have already shown that
open-source repositories are inundated by poor-quality notebooks. Low-quality
output from the prototypical stages of ML workflows constitutes a clear
bottleneck towards the productization of ML models. To foster the creation of
better notebooks, we developed Pynblint, a static analyzer for Jupyter
notebooks written in Python. The tool checks the compliance of notebooks (and
surrounding repositories) with a set of empirically validated best practices
and provides targeted recommendations when violations are detected.
Related papers
- Suggesting Code Edits in Interactive Machine Learning Notebooks Using Large Language Models [3.2433570328895196]
We present the first dataset of 48,398 Jupyter notebook edits derived from 20,095 revisions of 792 machine learning repositories on GitHub.
Our dataset captures granular details of cell-level and line-level modifications, offering a foundation for understanding real-world maintenance patterns in machine learning.
arXiv Detail & Related papers (2025-01-16T18:55:38Z) - PyPulse: A Python Library for Biosignal Imputation [58.35269251730328]
We introduce PyPulse, a Python package for imputation of biosignals in both clinical and wearable sensor settings.
PyPulse's framework provides a modular and extendable framework with high ease-of-use for a broad userbase, including non-machine-learning bioresearchers.
We released PyPulse under the MIT License on Github and PyPI.
arXiv Detail & Related papers (2024-12-09T11:00:55Z) - Untangling Knots: Leveraging LLM for Error Resolution in Computational Notebooks [4.318590074766604]
We propose a potential solution for resolving errors in computational notebooks via an iterative LLM-based agent.
We discuss the questions raised by this approach and share a novel dataset of computational notebooks containing bugs.
arXiv Detail & Related papers (2024-03-26T18:53:17Z) - DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows [72.40917624485822]
We introduce DataDreamer, an open source Python library that allows researchers to implement powerful large language models.
DataDreamer also helps researchers adhere to best practices that we propose to encourage open science.
arXiv Detail & Related papers (2024-02-16T00:10:26Z) - Jup2Kub: algorithms and a system to translate a Jupyter Notebook
pipeline to a fault tolerant distributed Kubernetes deployment [0.9790236766474201]
Scientific facilitate computational, data manipulation, and sometimes visualization steps for scientific data analysis.
Jupyter notebooks struggle to scale with larger data sets, lack failure tolerance, and depend heavily on the stability of underlying tools and packages.
Jup2Kup translates from Jupyter notebooks into a distributed, high-performance environment, enhancing fault tolerance.
arXiv Detail & Related papers (2023-11-21T02:54:06Z) - Julearn: an easy-to-use library for leakage-free evaluation and
inspection of ML models [0.23301643766310373]
We present the rationale behind julearn's design, its core features, and showcase three examples of previously-published research projects.
Julearn aims to simplify the entry into the machine learning world by providing an easy-to-use environment with built in guards against some of the most common ML pitfalls.
arXiv Detail & Related papers (2023-10-19T08:21:12Z) - PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels [59.66777287810985]
We introduce information-theoretic scores for privacy and utility, which quantify the average performance of an unfaithful user.
We then theoretically characterize primitives in building families of encoding schemes that motivate the use of random deep neural networks.
arXiv Detail & Related papers (2023-03-31T18:03:53Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z) - PyGOD: A Python Library for Graph Outlier Detection [56.33769221859135]
PyGOD is an open-source library for detecting outliers in graph data.
It supports a wide array of leading graph-based methods for outlier detection.
PyGOD is released under a BSD 2-Clause license at https://pygod.org and at the Python Package Index (PyPI)
arXiv Detail & Related papers (2022-04-26T06:15:21Z) - FedML: A Research Library and Benchmark for Federated Machine Learning [55.09054608875831]
Federated learning (FL) is a rapidly growing research field in machine learning.
Existing FL libraries cannot adequately support diverse algorithmic development.
We introduce FedML, an open research library and benchmark to facilitate FL algorithm development and fair performance comparison.
arXiv Detail & Related papers (2020-07-27T13:02:08Z) - ReproduceMeGit: A Visualization Tool for Analyzing Reproducibility of
Jupyter Notebooks [0.0]
We present ReproduceMeGit, a visualization tool for analyzing the GitHub of Jupyter Notebooks.
The tool provides information on the number of notebooks that were successfully reproducible, those that resulted in exceptions, those with different results from the original notebooks, etc.
arXiv Detail & Related papers (2020-06-22T10:05:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.