Related papers: Anatomizing Deep Learning Inference in Web Browsers

Anatomizing Deep Learning Inference in Web Browsers

URL: http://arxiv.org/abs/2402.05981v2
Date: Thu, 25 Jul 2024 13:37:16 GMT
Title: Anatomizing Deep Learning Inference in Web Browsers
Authors: Qipeng Wang, Shiqi Jiang, Zhenpeng Chen, Xu Cao, Yuanchun Li, Aoyu Li, Yun Ma, Ting Cao, Xuanzhe Liu,
Abstract summary: We make the first comprehensive performance measurement of in-browser inference to date. Our approach proposes new metrics to measure in-browser inference: responsiveness, smoothness, and inference accuracy. In-browser inference exhibits a substantial latency gap, averaging 16.9 times slower on CPU and 4.9 times slower on GPU compared to native inference on PC devices.
Score: 17.63663828498732
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Web applications have increasingly adopted Deep Learning (DL) through in-browser inference, wherein DL inference performs directly within Web browsers. The actual performance of in-browser inference and its impacts on the quality of experience (QoE) remain unexplored, and urgently require new QoE measurements beyond traditional ones, e.g., mainly focusing on page load time. To bridge this gap, we make the first comprehensive performance measurement of in-browser inference to date. Our approach proposes new metrics to measure in-browser inference: responsiveness, smoothness, and inference accuracy. Our extensive analysis involves 9 representative DL models across Web browsers of 50 popular PC devices and 20 mobile devices. The results reveal that in-browser inference exhibits a substantial latency gap, averaging 16.9 times slower on CPU and 4.9 times slower on GPU compared to native inference on PC devices. The gap on mobile CPU and mobile GPU is 15.8 times and 7.8 times, respectively. Furthermore, we identify contributing factors to such latency gap, including underutilized hardware instruction sets, inherent overhead in the runtime environment, resource contention within the browser, and inefficiencies in software libraries and GPU abstractions. Additionally, in-browser inference imposes significant memory demands, at times exceeding 334.6 times the size of the DL models themselves, partly attributable to suboptimal memory management. We also observe that in-browser inference leads to a significant 67.2% increase in the time it takes for GUI components to render within Web browsers, significantly affecting the overall user QoE of Web applications reliant on this technology

Related papers

WebANNS: Fast and Efficient Approximate Nearest Neighbor Search in Web Browsers [4.817548755757474]
In-browser nearest neighbor search (ANNS) has become vital to modern AI infrastructure.<n>We propose WebANNS, a novel ANNS engine specifically designed for web browsers.
arXiv Detail & Related papers (2025-07-01T07:37:18Z)
Mixture of Lookup Experts [63.787712153454464]
Mixture-of-Experts (MoE) activates only a subset of experts during inference. MoLE is a new MoE architecture that is efficient in both communication and VRAM usage.
arXiv Detail & Related papers (2025-03-20T02:31:57Z)
WebLLM: A High-Performance In-Browser LLM Inference Engine [9.771248136952039]
WebLLM is an open-source framework that enables high-performance LLM inference in web browsers. WebLLM provides an OpenAI-style API for seamless integration into web applications. Evaluations show that WebLLM can retain up to 80% native performance on the same device.
arXiv Detail & Related papers (2024-12-20T11:24:13Z)
DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference [14.676716521856813]
Mixture-of-Experts (MoE) models face significant deployment challenges on memory-constrained devices. We presentP, an on-device MoE inference engine to optimize parallel GPU- CPU execution. P outperforms traditional expert caching and prefetching methods by up to 8.20x and offloading techniques by 1.35x while maintaining accuracy.
arXiv Detail & Related papers (2024-12-16T07:59:21Z)
Benchmarking Edge AI Platforms for High-Performance ML Inference [0.0]
Edge computing's growing prominence, due to its ability to reduce communication latency and enable real-time processing, is promoting the rise of high-performance, heterogeneous System-on-Chip solutions. While current approaches often involve scaling down modern hardware, the performance characteristics of neural network workloads can vary significantly. We compare the latency and throughput of various linear algebra and neural network inference tasks across CPU-only, CPU/GPU, and CPU/NPU integrated solutions.
arXiv Detail & Related papers (2024-09-23T08:27:27Z)
An Efficient and Streaming Audio Visual Active Speaker Detection System [2.4515389321702132]
We present two scenarios that address the key challenges posed by real-time constraints. First, we introduce a method to limit the number of future context frames utilized by the ASD model. Second, we propose a more stringent constraint that limits the total number of past frames the model can access during inference.
arXiv Detail & Related papers (2024-09-13T17:45:53Z)
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z)
Green AI: A Preliminary Empirical Study on Energy Consumption in DL Models Across Different Runtime Infrastructures [56.200335252600354]
It is common practice to deploy pre-trained models on environments distinct from their native development settings. This led to the introduction of interchange formats such as ONNX, which includes its infrastructure, and ONNX, which work as standard formats.
arXiv Detail & Related papers (2024-02-21T09:18:44Z)
Empowering In-Browser Deep Learning Inference on Edge Devices with Just-in-Time Kernel Optimizations [30.477092899633785]
This paper presents the pioneering inbrowser inference system, nnJIT. nnJIT enables just-in-time (JIT) auto-generation of optimized computing kernels for edge devices. Results show that nnJIT can achieve up to 8.2X faster within 30 seconds compared to the existing baselines.
arXiv Detail & Related papers (2023-09-16T12:29:25Z)
Native vs Web Apps: Comparing the Energy Consumption and Performance of Android Apps and their Web Counterparts [5.18539596100998]
We select 10 Internet content platforms across 5 categories. We measure them based on the energy consumption, network traffic volume, CPU load, memory load, and frame time of their native and Web versions.
arXiv Detail & Related papers (2023-08-31T13:51:56Z)
FLEdge: Benchmarking Federated Machine Learning Applications in Edge Computing Systems [61.335229621081346]
Federated Learning (FL) has become a viable technique for realizing privacy-enhancing distributed deep learning on the network edge. In this paper, we propose FLEdge, which complements existing FL benchmarks by enabling a systematic evaluation of client capabilities.
arXiv Detail & Related papers (2023-06-08T13:11:20Z)
MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge [87.41163540910854]
Deep neural network (DNN) latency characterization is a time-consuming process. We propose MAPLE-X which extends MAPLE by incorporating explicit prior knowledge of hardware devices and DNN architecture latency.
arXiv Detail & Related papers (2022-05-25T11:08:20Z)
MAPLE-Edge: A Runtime Latency Predictor for Edge Devices [80.01591186546793]
We propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware. Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters. We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes.
arXiv Detail & Related papers (2022-04-27T14:00:48Z)
Accelerating Deep Learning Inference via Learned Caches [11.617579969991294]
Deep Neural Networks (DNNs) are witnessing increased adoption in multiple domains owing to their high accuracy in solving real-world problems. Current low latency solutions trade-off on accuracy or fail to exploit the inherent temporal locality in prediction serving workloads. We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency inference.
arXiv Detail & Related papers (2021-01-18T22:13:08Z)
Towards Real-Time DNN Inference on Mobile Platforms with Model Pruning and Compiler Optimization [56.3111706960878]
High-end mobile platforms serve as primary computing devices for a wide range of Deep Neural Network (DNN) applications. constrained computation and storage resources on these devices pose significant challenges for real-time inference executions. We propose a set of hardware-friendly structured model pruning and compiler optimization techniques to accelerate DNN executions on mobile devices.
arXiv Detail & Related papers (2020-04-22T03:18:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.