Machine Learning Systems are Bloated and Vulnerable
- URL: http://arxiv.org/abs/2212.09437v3
- Date: Thu, 25 Jan 2024 14:06:18 GMT
- Title: Machine Learning Systems are Bloated and Vulnerable
- Authors: Huaifeng Zhang, Fahmi Abdulqadir Ahmed, Dyako Fatih, Akayou Kitessa,
Mohannad Alhanahnah, Philipp Leitner, Ahmed Ali-Eldin
- Abstract summary: We develop MMLB, a framework for analyzing bloat in software systems.
MMLB measures the amount of bloat at both the container and package levels.
We show that bloat accounts for up to 80% of machine learning container sizes.
- Score: 2.7023370929727277
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Today's software is bloated with both code and features that are not used by
most users. This bloat is prevalent across the entire software stack, from
operating systems and applications to containers. Containers are lightweight
virtualization technologies used to package code and dependencies, providing
portable, reproducible and isolated environments. For their ease of use, data
scientists often utilize machine learning containers to simplify their
workflow. However, this convenience comes at a cost: containers are often
bloated with unnecessary code and dependencies, resulting in very large sizes.
In this paper, we analyze and quantify bloat in machine learning containers. We
develop MMLB, a framework for analyzing bloat in software systems, focusing on
machine learning containers. MMLB measures the amount of bloat at both the
container and package levels, quantifying the sources of bloat. In addition,
MMLB integrates with vulnerability analysis tools and performs package
dependency analysis to evaluate the impact of bloat on container
vulnerabilities. Through experimentation with 15 machine learning containers
from TensorFlow, PyTorch, and Nvidia, we show that bloat accounts for up to 80%
of machine learning container sizes, increasing container provisioning times by
up to 370% and exacerbating vulnerabilities by up to 99%.
Related papers
- The Hidden Bloat in Machine Learning Systems [0.22099217573031676]
Software bloat refers to code and features that is not used by a software during runtime.
For Machine Learning (ML) systems, bloat is a major contributor to their technical debt leading to decreased performance and resource wastage.
We present Negativa-ML, a novel tool to identify and remove bloat in ML frameworks by analyzing their shared libraries.
arXiv Detail & Related papers (2025-03-18T13:04:25Z) - Lightweight, Secure and Stateful Serverless Computing with PSL [43.025002382616066]
We present Function-as-a-Serivce (F) framework for Trusted Execution Environments (TEEs)
The framework provides rich programming language support on heterogeneous TEE hardware for statically compiled binaries and/or WebAssembly (WASM) bytecodes.
It achieves near-native execution speeds by utilizing the dynamic memory mapping capabilities of Intel SGX2.
arXiv Detail & Related papers (2024-10-25T23:17:56Z) - Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering.
Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications.
These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z) - An empirical study of bloated dependencies in CommonJS packages [6.115666382910127]
We conduct an empirical study to investigate the bloated dependencies that are entirely unused within server-side applications.
We propose a trace-based dynamic analysis that monitors file access, to determine which dependencies are not accessed during runtime.
Our findings suggest that native support for dependency debloating in package managers could significantly alleviate the burden of maintaining dependencies.
arXiv Detail & Related papers (2024-05-28T08:04:01Z) - LUCID: A Framework for Reducing False Positives and Inconsistencies Among Container Scanning Tools [0.0]
This paper provides a fully functional framework named LUCID that can reduce false positives and inconsistencies provided by multiple scanning tools.
Our results show that our framework can reduce inconsistencies by 70%.
We also create a Dynamic Classification component that can successfully classify and predict the different severity levels with an accuracy of 84%.
arXiv Detail & Related papers (2024-05-11T16:58:28Z) - Green AI: A Preliminary Empirical Study on Energy Consumption in DL
Models Across Different Runtime Infrastructures [56.200335252600354]
It is common practice to deploy pre-trained models on environments distinct from their native development settings.
This led to the introduction of interchange formats such as ONNX, which includes its infrastructure, and ONNX, which work as standard formats.
arXiv Detail & Related papers (2024-02-21T09:18:44Z) - Jup2Kub: algorithms and a system to translate a Jupyter Notebook
pipeline to a fault tolerant distributed Kubernetes deployment [0.9790236766474201]
Scientific facilitate computational, data manipulation, and sometimes visualization steps for scientific data analysis.
Jupyter notebooks struggle to scale with larger data sets, lack failure tolerance, and depend heavily on the stability of underlying tools and packages.
Jup2Kup translates from Jupyter notebooks into a distributed, high-performance environment, enhancing fault tolerance.
arXiv Detail & Related papers (2023-11-21T02:54:06Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - FLEdge: Benchmarking Federated Machine Learning Applications in Edge Computing Systems [61.335229621081346]
Federated Learning (FL) has become a viable technique for realizing privacy-enhancing distributed deep learning on the network edge.
In this paper, we propose FLEdge, which complements existing FL benchmarks by enabling a systematic evaluation of client capabilities.
arXiv Detail & Related papers (2023-06-08T13:11:20Z) - BLAFS: A Bloat Aware File System [2.3476033905954687]
We introduce BLAFS, a BLoat-Aware-file system for containers.
BLAFS guarantees debloating safety for both cloud and edge systems.
arXiv Detail & Related papers (2023-05-08T11:41:30Z) - Learned-Database Systems Security [46.898983878921484]
We develop a framework for identifying vulnerabilities that stem from the use of machine learning (ML)<n>We show that the use of ML cause leakage of past queries in a database, enable a poisoning attack that causes exponential memory blowup and crashes it in seconds.<n>We find that adversarial ML is an universal threat against learned components in database systems.
arXiv Detail & Related papers (2022-12-20T15:09:30Z) - Opacus: User-Friendly Differential Privacy Library in PyTorch [54.8720687562153]
We introduce Opacus, a free, open-source PyTorch library for training deep learning models with differential privacy.
It provides a simple and user-friendly API, and enables machine learning practitioners to make a training pipeline private by adding as little as two lines to their code.
arXiv Detail & Related papers (2021-09-25T07:10:54Z) - TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems [5.188829601887422]
Deep learning inference on embedded devices is a burgeoning field with myriad applications because tiny embedded devices are omnipresent.
Deep learning inference on embedded devices is a burgeoning field with myriad applications because tiny embedded devices are omnipresent.
We introduce Lite Micro, an open-source ML inference framework for running deep-learning models on embedded systems.
arXiv Detail & Related papers (2020-10-17T00:44:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.