Transactional Python for Durable Machine Learning: Vision, Challenges,
and Feasibility
- URL: http://arxiv.org/abs/2305.08770v1
- Date: Mon, 15 May 2023 16:27:09 GMT
- Title: Transactional Python for Durable Machine Learning: Vision, Challenges,
and Feasibility
- Authors: Supawit Chockchowwat, Zhaoheng Li, Yongjoo Park
- Abstract summary: Python applications may lose important data, such as trained models and extracted features, due to machine failures or human errors.
This paper presents our vision of transactional Python that provides DART without any code modifications to user programs or the Python kernel.
Our evaluation of a proof-of-concept implementation with public PyTorch and scikit-learn applications shows that DART can be offered with overheads ranging 1.5%--15.6%.
- Score: 5.669983975369642
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In machine learning (ML), Python serves as a convenient abstraction for
working with key libraries such as PyTorch, scikit-learn, and others. Unlike
DBMS, however, Python applications may lose important data, such as trained
models and extracted features, due to machine failures or human errors, leading
to a waste of time and resources. Specifically, they lack four essential
properties that could make ML more reliable and user-friendly -- durability,
atomicity, replicability, and time-versioning (DART).
This paper presents our vision of Transactional Python that provides DART
without any code modifications to user programs or the Python kernel, by
non-intrusively monitoring application states at the object level and
determining a minimal amount of information sufficient to reconstruct a whole
application. Our evaluation of a proof-of-concept implementation with public
PyTorch and scikit-learn applications shows that DART can be offered with
overheads ranging 1.5%--15.6%.
Related papers
- forester: A Tree-Based AutoML Tool in R [0.0]
The forester is an open-source AutoML package implemented in R for training high-quality tree-based models.
It fully supports binary and multiclass classification, regression, and partially survival analysis tasks.
With just a few functions, the user is capable of detecting issues regarding the data quality, preparing the preprocessing pipeline, training and tuning tree-based models, evaluating the results, and creating the report for further analysis.
arXiv Detail & Related papers (2024-09-07T10:39:10Z) - A Comprehensive Guide to Combining R and Python code for Data Science, Machine Learning and Reinforcement Learning [42.350737545269105]
We show how to run Python's scikit-learn, pytorch and OpenAI gym libraries for building Machine Learning, Deep Learning, and Reinforcement Learning projects easily.
arXiv Detail & Related papers (2024-07-19T23:01:48Z) - pyvene: A Library for Understanding and Improving PyTorch Models via
Interventions [79.72930339711478]
$textbfpyvene$ is an open-source library that supports customizable interventions on a range of different PyTorch modules.
We show how $textbfpyvene$ provides a unified framework for performing interventions on neural models and sharing the intervened upon models with others.
arXiv Detail & Related papers (2024-03-12T16:46:54Z) - SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot
Neural Sparse Retrieval [92.27387459751309]
We provide SPRINT, a unified Python toolkit for evaluating neural sparse retrieval.
We establish strong and reproducible zero-shot sparse retrieval baselines across the well-acknowledged benchmark, BEIR.
We show that SPLADEv2 produces sparse representations with a majority of tokens outside of the original query and document.
arXiv Detail & Related papers (2023-07-19T22:48:02Z) - Repairing Bugs in Python Assignments Using Large Language Models [9.973714032271708]
We propose to use a large language model trained on code to build an APR system for programming assignments.
Our system can fix both syntactic and semantic mistakes by combining multi-modal prompts, iterative querying, test-case-based selection of few-shots, and program chunking.
We evaluate MMAPR on 286 real student programs and compare to a baseline built by combining a state-of-the-art Python syntax repair engine, BIFI, and state-of-the-art Python semantic repair engine for student assignments, Refactory.
arXiv Detail & Related papers (2022-09-29T15:41:17Z) - problexity -- an open-source Python library for binary classification
problem complexity assessment [0.0]
The classification problem's complexity assessment is an essential element of many topics in the supervised learning domain.
The tools currently available for the academic community, which would enable the calculation of problem complexity measures, are available only as libraries of the C++ and R languages.
This paper describes the software module that allows for the estimation of 22 complexity measures for the Python language.
arXiv Detail & Related papers (2022-07-14T07:32:15Z) - PyGOD: A Python Library for Graph Outlier Detection [56.33769221859135]
PyGOD is an open-source library for detecting outliers in graph data.
It supports a wide array of leading graph-based methods for outlier detection.
PyGOD is released under a BSD 2-Clause license at https://pygod.org and at the Python Package Index (PyPI)
arXiv Detail & Related papers (2022-04-26T06:15:21Z) - PyTorchVideo: A Deep Learning Library for Video Understanding [71.89124881732015]
PyTorchVideo is an open-source deep-learning library for video understanding tasks.
It covers a full stack of video understanding tools including multimodal data loading, transformations, and models.
The library is based on PyTorch and can be used by any training framework.
arXiv Detail & Related papers (2021-11-18T18:59:58Z) - Scikit-dimension: a Python package for intrinsic dimension estimation [58.8599521537]
This technical note introduces textttscikit-dimension, an open-source Python package for intrinsic dimension estimation.
textttscikit-dimension package provides a uniform implementation of most of the known ID estimators based on scikit-learn application programming interface.
We briefly describe the package and demonstrate its use in a large-scale (more than 500 datasets) benchmarking of methods for ID estimation in real-life and synthetic data.
arXiv Detail & Related papers (2021-09-06T16:46:38Z) - DoubleML -- An Object-Oriented Implementation of Double Machine Learning
in Python [1.4911092205861822]
DoubleML is an open-source Python library implementing the double machine learning framework of Chernozhukov et al.
It contains functionalities for valid statistical inference on causal parameters when the estimation of parameters is based on machine learning methods.
The package is distributed under the MIT license and relies on core libraries from the scientific Python ecosystem.
arXiv Detail & Related papers (2021-04-07T16:16:39Z) - OPFython: A Python-Inspired Optimum-Path Forest Classifier [68.8204255655161]
This paper proposes a Python-based Optimum-Path Forest framework, denoted as OPFython.
As OPFython is a Python-based library, it provides a more friendly environment and a faster prototyping workspace than the C language.
arXiv Detail & Related papers (2020-01-28T15:46:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.