Related papers: CARMA: Collocation-Aware Resource Manager

CARMA: Collocation-Aware Resource Manager

URL: http://arxiv.org/abs/2508.19073v2
Date: Sat, 01 Nov 2025 16:13:11 GMT
Title: CARMA: Collocation-Aware Resource Manager
Authors: Ehsan Yousefzadeh-Asl-Miandoab, Reza Karimzadeh, Bulat Ibragimov, Florina M. Ciorba, Pınar Tözün,
Abstract summary: Collocating multiple deep learning (DL) training tasks on the same GPU can improve utilization but introduces two key risks.<n>We present CARMA, a task-level, collocation-aware resource management system for the server-scale.
Score: 5.998463702026698
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: GPUs running deep learning (DL) workloads are frequently underutilized. Collocating multiple DL training tasks on the same GPU can improve utilization but introduces two key risks: (1) out-of-memory (OOM) crashes for newly scheduled tasks, and (2) severe performance interference among co-running tasks, which can negate any throughput gains. These issues reduce system robustness, quality of service, and energy efficiency. We present CARMA, a task-level, collocation-aware resource management system for the server-scale. CARMA addresses collocation challenges via (1) fine-grained monitoring and bookkeeping of GPUs and a collocation risk analysis that filters out the high-risk GPUs; (2) task placement policies that cap GPU utilization to avoid OOMs and limit interference; (3) integration of GPU memory need estimators for DL tasks to minimize OOMs during collocation; and (4) a lightweight recovery method that relaunches jobs crashed due to OOMs. Our evaluation on a DL training workload derived from real-world traces shows that CARMA uses GPUs more efficiently by making more informed collocation decisions: for the best-performing collocation policy, CARMA increases GPU streaming multiprocessor (SM) utilization by 54%, the parallelism achieved per SM by 61%, and memory use by 62%. This results in a $\sim$35% and $\sim$15% reduction in the end-to-end execution time (makespan) and GPU energy consumption, respectively, for this workload.

Related papers

MemSifter: Offloading LLM Memory Retrieval via Outcome-Driven Proxy Reasoning [78.46301394559903]
Large Language Models (LLMs) are increasingly used for long-duration tasks.<n>Current methods face a trade-off between cost and accuracy.<n>MemSifter is a novel framework that offloads the memory retrieval process to a small-scale proxy model.
arXiv Detail & Related papers (2026-03-03T02:57:38Z)
WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving [17.92164698813269]
Existing multi-LLM serving systems optimize GPU utilization at the cost of worse inference performance.<n>We propose universal GPU workers to enable one-for-many GPU prewarming that loads models with knowledge of future workloads.<n>WarmServe improves TTFT by up to 50.8$times$ compared to the state-of-the-art autoscaling-based system.
arXiv Detail & Related papers (2025-12-10T09:47:40Z)
xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads [2.2991119948183525]
estimation of how much GPU memory a job will require is fundamental to enabling advanced scheduling and GPU sharing.<n>We propose xMem, a novel framework that leverages CPU-only dynamic analysis to accurately estimate peak GPU memory requirements.<n>The analysis of 5209 runs, which includes ANOVA and Monte Carlo results, highlights xMem's benefits.
arXiv Detail & Related papers (2025-10-23T23:16:27Z)
Semantic-Aware Scheduling for GPU Clusters with Large Language Models [60.14838697778884]
We propose SchedMate, a framework that bridges the semantic gap between schedulers and jobs they manage.<n>SchedMate extracts deep insights from overlooked, unstructured data sources: source code, runtime logs, and historical jobs.<n>We show SchedMate reduces average job completion times by up to 1.91x, substantially enhancing the scheduling performance.
arXiv Detail & Related papers (2025-10-02T02:01:02Z)
NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z)
SpecOffload: Unlocking Latent GPU Capacity for LLM Inference on Resource-Constrained Devices [16.407669822378487]
SpecOffload embeds speculative decoding into offloading.<n>Compared to the best baseline, SpecOffload improves GPU core utilization by 4.49x and boosts inference throughput by 2.54x.
arXiv Detail & Related papers (2025-05-15T13:10:31Z)
Accurate GPU Memory Prediction for Deep Learning Jobs through Dynamic Analysis [0.3867363075280544]
Out-of-Memory errors present a primary impediment to model training and efficient resource utilization.<n>VeritasEst is an entirely CPU-based analysis tool capable of accurately predicting the peak GPU memory required for Deep Learning training tasks.<n>Its performance was validated through thousands of experimental runs across convolutional neural network (CNN) models.
arXiv Detail & Related papers (2025-04-04T19:20:03Z)
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM. DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens [57.354304637367555]
We present EVEREST, a surprisingly efficient MVA approach for video representation learning. It finds tokens containing rich motion features and discards uninformative ones during both pre-training and fine-tuning. Our method significantly reduces the computation and memory requirements of MVA.
arXiv Detail & Related papers (2022-11-19T09:57:01Z)
NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS) We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z)
ETAD: A Unified Framework for Efficient Temporal Action Detection [70.21104995731085]
Untrimmed video understanding such as temporal action detection (TAD) often suffers from the pain of huge demand for computing resources. We build a unified framework for efficient end-to-end temporal action detection (ETAD) ETAD achieves state-of-the-art performance on both THUMOS-14 and ActivityNet-1.3.
arXiv Detail & Related papers (2022-05-14T21:16:21Z)
Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters [10.38396444951436]
Training Deep Neural Networks (DNNs) is a widely popular workload in both enterprises and cloud data centers. We propose Synergy, a resource-sensitive scheduler for shared GPU clusters. Our experiments show that workload-aware CPU and memory allocations can improve average JCT up to 3.4x when compared to traditional GPU-proportional scheduling.
arXiv Detail & Related papers (2021-10-12T15:25:54Z)
TENSILE: A Tensor granularity dynamic GPU memory scheduler method towards multiple dynamic workloads system [9.86589655261934]
TENSILE is a method of managing GPU memory in tensor granularity to reduce the GPU memory peak. We implement TENSILE on our own deep learning framework, and evaluated its performance.
arXiv Detail & Related papers (2021-05-27T17:46:16Z)
Faster than FAST: GPU-Accelerated Frontend for High-Speed VIO [46.20949184826173]
This work focuses on the applicability of efficient low-level, GPU hardware-specific instructions to improve on existing computer vision algorithms. Especially non-maxima suppression and the subsequent feature selection are prominent contributors to the overall image processing latency.
arXiv Detail & Related papers (2020-03-30T14:16:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.