GEMEL: Model Merging for Memory-Efficient, Real-Time Video Analytics at
the Edge
- URL: http://arxiv.org/abs/2201.07705v1
- Date: Wed, 19 Jan 2022 16:45:04 GMT
- Title: GEMEL: Model Merging for Memory-Efficient, Real-Time Video Analytics at
the Edge
- Authors: Arthi Padmanabhan, Neil Agarwal, Anand Iyer, Ganesh Ananthanarayanan,
Yuanchao Shu, Nikolaos Karianakis, Guoqing Harry Xu, Ravi Netravali
- Abstract summary: We present model merging, a new memory management technique that exploits architectural similarities between edge vision models.
Experiments across diverse workloads reveal that GEMEL reduces memory usage by up to 60.7%, and improves overall accuracy by 8-39% relative to time/space sharing alone.
- Score: 10.276140547573437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video analytics pipelines have steadily shifted to edge deployments to reduce
bandwidth overheads and privacy violations, but in doing so, face an
ever-growing resource tension. Most notably, edge-box GPUs lack the memory
needed to concurrently house the growing number of (increasingly complex)
models for real-time inference. Unfortunately, existing solutions that rely on
time/space sharing of GPU resources are insufficient as the required swapping
delays result in unacceptable frame drops and accuracy violations. We present
model merging, a new memory management technique that exploits architectural
similarities between edge vision models by judiciously sharing their layers
(including weights) to reduce workload memory costs and swapping delays. Our
system, GEMEL, efficiently integrates merging into existing pipelines by (1)
leveraging several guiding observations about per-model memory usage and
inter-layer dependencies to quickly identify fruitful and accuracy-preserving
merging configurations, and (2) altering edge inference schedules to maximize
merging benefits. Experiments across diverse workloads reveal that GEMEL
reduces memory usage by up to 60.7%, and improves overall accuracy by 8-39%
relative to time/space sharing alone.
Related papers
- LiVOS: Light Video Object Segmentation with Gated Linear Matching [116.58237547253935]
LiVOS is a lightweight memory network that employs linear matching via linear attention.
For longer and higher-resolution videos, it matched STM-based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer-grade GPU.
arXiv Detail & Related papers (2024-11-05T05:36:17Z) - Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks.
We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems.
Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z) - TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices [36.714057078457195]
We present TPI-LLM, a compute- and memory-efficient tensor parallel inference system for 70B-scale models.
TPI-LLM keeps sensitive raw data local in the users' devices and introduces a sliding window memory scheduler.
We show that TPI-LLM demonstrated over 80% less time-to-first-token and token latency compared to Accelerate.
arXiv Detail & Related papers (2024-10-01T09:18:56Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - Efficient Video Object Segmentation via Modulated Cross-Attention Memory [123.12273176475863]
We propose a transformer-based approach, named MAVOS, to model temporal smoothness without requiring frequent memory expansion.
Our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU.
arXiv Detail & Related papers (2024-03-26T17:59:58Z) - JORA: JAX Tensor-Parallel LoRA Library for Retrieval Augmented Fine-Tuning [16.86356520836045]
We introduce a novel framework for PEFT-compatible fine-tuning of Llama-2 models, leveraging distributed training.
Our framework uniquely utilizes JAX's just-in-time (JIT) compilation and tensor-sharding for efficient resource management.
Our experiments show more than 12x improvement in runtime compared to Hugging Face/DeepSpeed implementation with four GPU while consuming less than half the VRAM per GPU.
arXiv Detail & Related papers (2024-03-17T23:02:04Z) - LR-CNN: Lightweight Row-centric Convolutional Neural Network Training
for Memory Reduction [21.388549904063538]
Convolutional Neural Network with a multi-layer architecture has advanced rapidly.
Current efforts mitigate such bottleneck by external auxiliary solutions with additional hardware costs, and internal modifications with potential accuracy penalty.
We break the traditional layer-by-layer (column) dataflow rule. Now operations are novelly re-organized into rows throughout all convolution layers.
This lightweight design allows a majority of intermediate data to be removed without any loss of accuracy.
arXiv Detail & Related papers (2024-01-21T12:19:13Z) - MAFAT: Memory-Aware Fusing and Tiling of Neural Networks for Accelerated
Edge Inference [1.7894377200944507]
Machine learning networks can easily exceed available memory, increasing latency due to excessive OS swapping.
We propose a memory usage predictor coupled with a search algorithm to provide optimized fusing and tiling configurations.
Results show that our approach can run in less than half the memory, and with a speedup of up to 2.78 under severe memory constraints.
arXiv Detail & Related papers (2021-07-14T19:45:49Z) - Temporal Memory Relation Network for Workflow Recognition from Surgical
Video [53.20825496640025]
We propose a novel end-to-end temporal memory relation network (TMNet) for relating long-range and multi-scale temporal patterns.
We have extensively validated our approach on two benchmark surgical video datasets.
arXiv Detail & Related papers (2021-03-30T13:20:26Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.