Related papers: KML: Using Machine Learning to Improve Storage Systems

KML: Using Machine Learning to Improve Storage Systems

URL: http://arxiv.org/abs/2111.11554v1
Date: Mon, 22 Nov 2021 21:59:50 GMT
Title: KML: Using Machine Learning to Improve Storage Systems
Authors: Ibrahim Umit Akgun, Ali Selman Aydin, Aadil Shaikh, Lukas Velikov, Andrew Burford, Michael McNeill, Michael Arkhangelskiy, and Erez Zadok
Abstract summary: Machine learning techniques promise to learn patterns, generalize from them, and enable optimal solutions. We develop a prototype KML architecture and apply it to two problems: optimal read and read-size values. Experiments show that KML consumes little OS resources, adds negligible latency, and yet can learn patterns that can improve I/O throughput by as much as 2.3x or 15x.
Score: 0.2810625954925814
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Operating systems include many heuristic algorithms designed to improve overall storage performance and throughput. Because such heuristics cannot work well for all conditions and workloads, system designers resorted to exposing numerous tunable parameters to users -- essentially burdening users with continually optimizing their own storage systems and applications. Storage systems are usually responsible for most latency in I/O heavy applications, so even a small overall latency improvement can be significant. Machine learning (ML) techniques promise to learn patterns, generalize from them, and enable optimal solutions that adapt to changing workloads. We propose that ML solutions become a first-class component in OSs and replace manual heuristics to optimize storage systems dynamically. In this paper, we describe our proposed ML architecture, called KML. We developed a prototype KML architecture and applied it to two problems: optimal readahead and NFS read-size values. Our experiments show that KML consumes little OS resources, adds negligible latency, and yet can learn patterns that can improve I/O throughput by as much as 2.3x or 15x for the two use cases respectively -- even for complex, never-before-seen, concurrently running mixed workloads on different storage devices.

Related papers

BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving [3.620158146761518]
BucketServe is a bucket-based dynamic framework designed to optimize inference performance.<n>It can handle 1.93x more request load load attainment of 80% compared with UELLM and demonstrates 1.975x higher system load capacity compared to the UELLM.
arXiv Detail & Related papers (2025-07-23T01:51:48Z)
SysLLMatic: Large Language Models are Software System Optimizers [2.4416377721219145]
We present SysLLMatic, a system that integrates Large Language Models with profiling-guided feedback and system performance insights.<n>We evaluate it on three benchmark suites: HumanEval_Bench (competitive programming in C++), SciMark2 (scientific kernels in Java), and DaCapoBench (large-scale software systems in Java)
arXiv Detail & Related papers (2025-06-02T01:57:21Z)
Deep-Learning-Driven Prefetching for Far Memory [4.128884162772407]
We present FarSight, a Linux-based far-memory system that leverages deep learning (DL) to efficiently perform accurate data prefetching.<n>Our evaluation of FarSight on four data-intensive workloads shows that it outperforms the state-of-the-art far-memory system by up to 3.6 times.
arXiv Detail & Related papers (2025-05-31T04:27:22Z)
Efficient Multi-modal Long Context Learning for Training-free Adaptation [96.21248144937627]
This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC)<n>It embeds demonstration examples directly into the model input.<n>It condenses long-context multimodal inputs into compact, task-specific memory representations.
arXiv Detail & Related papers (2025-05-26T10:49:44Z)
MLZero: A Multi-Agent System for End-to-end Machine Learning Automation [48.716299953336346]
We introduce MLZero, a novel multi-agent framework powered by Large Language Models (LLMs)<n>A cognitive perception module is first employed, transforming raw multimodal inputs into perceptual context.<n> MLZero demonstrates superior performance on MLE-Bench Lite, outperforming all competitors in both success rate and solution quality.
arXiv Detail & Related papers (2025-05-20T05:20:53Z)
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System [13.678531084541666]
We propose PAPI, a PIM-enabled heterogeneous architecture that exploits dynamic scheduling of compute-bound or memory-bound kernels to suitable hardware units. PAPI achieves 1.8$times$ and 11.1$times$ speed over a state-of-the-art heterogeneous accelerator and a state-of-the-art PIM-only accelerator.
arXiv Detail & Related papers (2025-02-21T13:52:31Z)
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving [2.7309692684728613]
Large language models (LLMs) are widely used across various applications, but their substantial computational requirements pose significant challenges. We present PRESERVE, a novel prefetching framework designed to optimize LLM inference by overlapping memory reads for model weights and KV-cache with collective communication operations.
arXiv Detail & Related papers (2025-01-14T15:14:10Z)
Efficiently Serving Large Multimodal Models Using EPD Disaggregation [24.05805398635414]
We introduce Encode-Prefill-Decode Disaggregation, a novel framework that separates the encoding, prefill, and decode stages onto dedicated resources. We show substantial gains in memory efficiency (up to 15$times$ less utilization), batch sizes (up to 22$times$ larger), 10$times$ more images/request, and 2.2$times$ larger KV caches.
arXiv Detail & Related papers (2024-12-25T10:11:31Z)
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models. Our approach employs activation sparsity to extract experts. Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
Preble: Efficient Distributed Prompt Scheduling for LLM Serving [8.706905652975554]
This paper proposes Preble, the first distributed LLM serving platform that targets and optimize for prompt sharing. We designed a distributed scheduling system that co-optimizes KV state reuse and computation load-balancing with a new scheduling algorithm and a hierarchical scheduling mechanism. Our evaluation of Preble with real workloads and request arrival patterns on two open-source LLMs shows that Preble outperforms the SOTA serving systems by 1.5X to 14.5X on average latency and 2X to 10X on p99 latency.
arXiv Detail & Related papers (2024-05-08T06:30:58Z)
PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory System [21.09681871279162]
Modern Machine Learning (ML) training on large-scale datasets is a time-consuming workload. It relies on the optimization algorithm Gradient Descent (SGD) due to its effectiveness, simplicity, and generalization performance. processor-centric architectures suffer from low performance and high energy consumption while executing ML training workloads. Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck.
arXiv Detail & Related papers (2024-04-10T17:00:04Z)
Online Adaptation of Language Models with a Memory of Amortized Contexts [82.02369596879817]
Memory of Amortized Contexts (MAC) is an efficient and effective online adaptation framework for large language models. We show how MAC can be combined with and improve the performance of popular alternatives such as retrieval augmented generations.
arXiv Detail & Related papers (2024-03-07T08:34:57Z)
Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval. We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z)
L2MAC: Large Language Model Automatic Computer for Extensive Code Generation [52.81694565226513]
Transformer-based large language models (LLMs) are constrained by the fixed context window of the underlying transformer architecture. This paper presents L2MAC, the first practical LLM-based general-purpose stored-program automatic computer (von Neumann architecture) framework, for long and consistent output generation.
arXiv Detail & Related papers (2023-10-02T16:55:19Z)
Efficient Memory Management for Large Language Model Serving with PagedAttention [44.70922552274376]
High throughput serving of large language models (LLMs) requires sufficiently many requests at a time. Existing systems struggle because the key-value cache ( KV cache) memory for each request is huge and grows and shrinks dynamically. We propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems.
arXiv Detail & Related papers (2023-09-12T12:50:04Z)
NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS) We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.