Related papers: A Flexible Cell Classification for ML Projects in Jupyter Notebooks

A Flexible Cell Classification for ML Projects in Jupyter Notebooks

URL: http://arxiv.org/abs/2403.07562v1
Date: Tue, 12 Mar 2024 11:50:47 GMT
Title: A Flexible Cell Classification for ML Projects in Jupyter Notebooks
Authors: Miguel Perez and Selin Aydin and Horst Lichter
Abstract summary: This paper presents a more flexible approach to cell classification based on a hybrid classification approach that combines a rule-based and a decision tree classifier. We implemented the new flexible cell classification approach in a tool called JupyLabel.
Score: 0.21485350418225244
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Jupyter Notebook is an interactive development environment commonly used for rapid experimentation of machine learning (ML) solutions. Describing the ML activities performed along code cells improves the readability and understanding of Notebooks. Manual annotation of code cells is time-consuming and error-prone. Therefore, tools have been developed that classify the cells of a notebook concerning the ML activity performed in them. However, the current tools are not flexible, as they work based on look-up tables that have been created, which map function calls of commonly used ML libraries to ML activities. These tables must be manually adjusted to account for new or changed libraries. This paper presents a more flexible approach to cell classification based on a hybrid classification approach that combines a rule-based and a decision tree classifier. We discuss the design rationales and describe the developed classifiers in detail. We implemented the new flexible cell classification approach in a tool called JupyLabel. Its evaluation and the obtained metric scores regarding precision, recall, and F1-score are discussed. Additionally, we compared JupyLabel with HeaderGen, an existing cell classification tool. We were able to show that the presented flexible cell classification approach outperforms this tool significantly.

Related papers

Learning Library Cell Representations in Vector Space [18.691688765200283]
We propose Lib2Vec, a novel self-supervised framework to efficiently learn meaningful vector representations of library cells. The framework comprises three key components: (1) an automated method for generating regularity tests to quantitatively evaluate how well cell representations reflect inter-cell relationships; (2) a self-supervised learning scheme that systematically extracts training data from Liberty files, removing the need for costly labeling; and (3) an attention-based model architecture that accommodates various pin counts and enables the creation of property-specific cell and arc embeddings.
arXiv Detail & Related papers (2025-03-28T22:04:57Z)
Typhon: Automatic Recommendation of Relevant Code Cells in Jupyter Notebooks [0.3122672716129843]
This paper proposes Typhon, an approach to automatically recommend relevant code cells in Jupyter notebooks. Typhon tokenizes developers' markdown description cells and looks for the most similar code cells from the database. We evaluated the Typhon tool on Jupyter notebooks from Kaggle competitions and found that the approach can recommend code cells with moderate accuracy.
arXiv Detail & Related papers (2024-05-15T03:59:59Z)
RigLSTM: Recurrent Independent Grid LSTM for Generalizable Sequence Learning [75.61681328968714]
We propose recurrent independent Grid LSTM (RigLSTM) to exploit the underlying modular structure of the target task. Our model adopts cell selection, input feature selection, hidden state selection, and soft state updating to achieve a better generalization ability.
arXiv Detail & Related papers (2023-11-03T07:40:06Z)
Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners. We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting. Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z)
Static Analysis Driven Enhancements for Comprehension in Machine Learning Notebooks [7.142786325863891]
Jupyter notebooks enable developers to interleave code snippets with rich-text and in-line visualizations. Recent studies have demonstrated that a large portion of Jupyter notebooks are undocumented and lacks a narrative structure. This paper presents HeaderGen, a novel tool-based approach that automatically annotates code cells with categorical markdown headers.
arXiv Detail & Related papers (2023-01-11T11:57:52Z)
End-to-End Learning to Index and Search in Large Output Spaces [95.16066833532396]
Extreme multi-label classification (XMC) is a popular framework for solving real-world problems. In this paper, we propose a novel method which relaxes the tree-based index to a specialized weighted graph-based index. ELIAS achieves state-of-the-art performance on several large-scale extreme classification benchmarks with millions of labels.
arXiv Detail & Related papers (2022-10-16T01:34:17Z)
Class-Incremental Lifelong Learning in Multi-Label Classification [3.711485819097916]
This paper studies Lifelong Multi-Label (LML) classification, which builds an online class-incremental classifier in a sequential multi-label classification data stream. To solve the problem, the study proposes an Augmented Graph Convolutional Network (AGCN) with a built Augmented Correlation Matrix (ACM) across sequential partial-label tasks.
arXiv Detail & Related papers (2022-07-16T05:14:07Z)
Few-Shot Class-Incremental Learning by Sampling Multi-Phase Tasks [59.12108527904171]
A model should recognize new classes and maintain discriminability over old classes. The task of recognizing few-shot new classes without forgetting old classes is called few-shot class-incremental learning (FSCIL) We propose a new paradigm for FSCIL based on meta-learning by LearnIng Multi-phase Incremental Tasks (LIMIT)
arXiv Detail & Related papers (2022-03-31T13:46:41Z)
Context-aware Execution Migration Tool for Data Science Jupyter Notebooks on Hybrid Clouds [0.22908242575265025]
This paper presents a solution developed as a Jupyter extension that automatically selects which cells, as well as in which scenarios, such cells should be migrated to a more suitable platform for execution. Using notebooks from Earth science (remote sensing), image recognition, and hand written digit identification (machine learning), our experiments show notebook state reductions of up to 55x and migration decisions leading to performance gains of up to 3.25x when the user interactivity with the notebook is taken into consideration.
arXiv Detail & Related papers (2021-07-01T02:33:18Z)
SLADE: A Self-Training Framework For Distance Metric Learning [75.54078592084217]
We present a self-training framework, SLADE, to improve retrieval performance by leveraging additional unlabeled data. We first train a teacher model on the labeled data and use it to generate pseudo labels for the unlabeled data. We then train a student model on both labels and pseudo labels to generate final feature embeddings.
arXiv Detail & Related papers (2020-11-20T08:26:10Z)
Split and Expand: An inference-time improvement for Weakly Supervised Cell Instance Segmentation [71.50526869670716]
We propose a two-step post-processing procedure, Split and Expand, to improve the conversion of segmentation maps to instances. In the Split step, we split clumps of cells from the segmentation map into individual cell instances with the guidance of cell-center predictions. In the Expand step, we find missing small cells using the cell-center predictions.
arXiv Detail & Related papers (2020-07-21T14:05:09Z)
Evolution of Scikit-Learn Pipelines with Dynamic Structured Grammatical Evolution [1.5224436211478214]
This paper describes a novel grammar-based framework that adapts Dynamic Structured Grammatical Evolution (DSGE) to the evolution of Scikit-Learn classification pipelines. The experimental results include comparing AutoML-DSGE to another grammar-based AutoML framework, Resilient ClassificationPipeline Evolution (RECIPE)
arXiv Detail & Related papers (2020-04-01T09:31:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.