A Flexible Cell Classification for ML Projects in Jupyter Notebooks
- URL: http://arxiv.org/abs/2403.07562v1
- Date: Tue, 12 Mar 2024 11:50:47 GMT
- Title: A Flexible Cell Classification for ML Projects in Jupyter Notebooks
- Authors: Miguel Perez and Selin Aydin and Horst Lichter
- Abstract summary: This paper presents a more flexible approach to cell classification based on a hybrid classification approach that combines a rule-based and a decision tree classifier.
We implemented the new flexible cell classification approach in a tool called JupyLabel.
- Score: 0.21485350418225244
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Jupyter Notebook is an interactive development environment commonly used for
rapid experimentation of machine learning (ML) solutions. Describing the ML
activities performed along code cells improves the readability and
understanding of Notebooks. Manual annotation of code cells is time-consuming
and error-prone. Therefore, tools have been developed that classify the cells
of a notebook concerning the ML activity performed in them. However, the
current tools are not flexible, as they work based on look-up tables that have
been created, which map function calls of commonly used ML libraries to ML
activities. These tables must be manually adjusted to account for new or
changed libraries.
This paper presents a more flexible approach to cell classification based on
a hybrid classification approach that combines a rule-based and a decision tree
classifier. We discuss the design rationales and describe the developed
classifiers in detail. We implemented the new flexible cell classification
approach in a tool called JupyLabel. Its evaluation and the obtained metric
scores regarding precision, recall, and F1-score are discussed. Additionally,
we compared JupyLabel with HeaderGen, an existing cell classification tool. We
were able to show that the presented flexible cell classification approach
outperforms this tool significantly.
Related papers
- Typhon: Automatic Recommendation of Relevant Code Cells in Jupyter Notebooks [0.3122672716129843]
This paper proposes Typhon, an approach to automatically recommend relevant code cells in Jupyter notebooks.
Typhon tokenizes developers' markdown description cells and looks for the most similar code cells from the database.
We evaluated the Typhon tool on Jupyter notebooks from Kaggle competitions and found that the approach can recommend code cells with moderate accuracy.
arXiv Detail & Related papers (2024-05-15T03:59:59Z) - RigLSTM: Recurrent Independent Grid LSTM for Generalizable Sequence
Learning [75.61681328968714]
We propose recurrent independent Grid LSTM (RigLSTM) to exploit the underlying modular structure of the target task.
Our model adopts cell selection, input feature selection, hidden state selection, and soft state updating to achieve a better generalization ability.
arXiv Detail & Related papers (2023-11-03T07:40:06Z) - Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners.
We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting.
Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z) - Static Analysis Driven Enhancements for Comprehension in Machine Learning Notebooks [7.142786325863891]
Jupyter notebooks enable developers to interleave code snippets with rich-text and in-line visualizations.
Recent studies have demonstrated that a large portion of Jupyter notebooks are undocumented and lacks a narrative structure.
This paper presents HeaderGen, a novel tool-based approach that automatically annotates code cells with categorical markdown headers.
arXiv Detail & Related papers (2023-01-11T11:57:52Z) - End-to-End Learning to Index and Search in Large Output Spaces [95.16066833532396]
Extreme multi-label classification (XMC) is a popular framework for solving real-world problems.
In this paper, we propose a novel method which relaxes the tree-based index to a specialized weighted graph-based index.
ELIAS achieves state-of-the-art performance on several large-scale extreme classification benchmarks with millions of labels.
arXiv Detail & Related papers (2022-10-16T01:34:17Z) - Class-Incremental Lifelong Learning in Multi-Label Classification [3.711485819097916]
This paper studies Lifelong Multi-Label (LML) classification, which builds an online class-incremental classifier in a sequential multi-label classification data stream.
To solve the problem, the study proposes an Augmented Graph Convolutional Network (AGCN) with a built Augmented Correlation Matrix (ACM) across sequential partial-label tasks.
arXiv Detail & Related papers (2022-07-16T05:14:07Z) - Few-Shot Class-Incremental Learning by Sampling Multi-Phase Tasks [59.12108527904171]
A model should recognize new classes and maintain discriminability over old classes.
The task of recognizing few-shot new classes without forgetting old classes is called few-shot class-incremental learning (FSCIL)
We propose a new paradigm for FSCIL based on meta-learning by LearnIng Multi-phase Incremental Tasks (LIMIT)
arXiv Detail & Related papers (2022-03-31T13:46:41Z) - Context-aware Execution Migration Tool for Data Science Jupyter
Notebooks on Hybrid Clouds [0.22908242575265025]
This paper presents a solution developed as a Jupyter extension that automatically selects which cells, as well as in which scenarios, such cells should be migrated to a more suitable platform for execution.
Using notebooks from Earth science (remote sensing), image recognition, and hand written digit identification (machine learning), our experiments show notebook state reductions of up to 55x and migration decisions leading to performance gains of up to 3.25x when the user interactivity with the notebook is taken into consideration.
arXiv Detail & Related papers (2021-07-01T02:33:18Z) - SLADE: A Self-Training Framework For Distance Metric Learning [75.54078592084217]
We present a self-training framework, SLADE, to improve retrieval performance by leveraging additional unlabeled data.
We first train a teacher model on the labeled data and use it to generate pseudo labels for the unlabeled data.
We then train a student model on both labels and pseudo labels to generate final feature embeddings.
arXiv Detail & Related papers (2020-11-20T08:26:10Z) - Split and Expand: An inference-time improvement for Weakly Supervised
Cell Instance Segmentation [71.50526869670716]
We propose a two-step post-processing procedure, Split and Expand, to improve the conversion of segmentation maps to instances.
In the Split step, we split clumps of cells from the segmentation map into individual cell instances with the guidance of cell-center predictions.
In the Expand step, we find missing small cells using the cell-center predictions.
arXiv Detail & Related papers (2020-07-21T14:05:09Z) - Evolution of Scikit-Learn Pipelines with Dynamic Structured Grammatical
Evolution [1.5224436211478214]
This paper describes a novel grammar-based framework that adapts Dynamic Structured Grammatical Evolution (DSGE) to the evolution of Scikit-Learn classification pipelines.
The experimental results include comparing AutoML-DSGE to another grammar-based AutoML framework, Resilient ClassificationPipeline Evolution (RECIPE)
arXiv Detail & Related papers (2020-04-01T09:31:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.