ConsumerBench: Benchmarking Generative AI Applications on End-User Devices
- URL: http://arxiv.org/abs/2506.17538v1
- Date: Sat, 21 Jun 2025 01:32:22 GMT
- Title: ConsumerBench: Benchmarking Generative AI Applications on End-User Devices
- Authors: Yile Gu, Rohan Kadekodi, Hoang Nguyen, Keisuke Kamahori, Yiyu Liu, Baris Kasikci,
- Abstract summary: The recent shift in Generative AI (GenAI) applications from cloud-only environments to end-user devices introduces new challenges in resource management, system efficiency, and user experience.<n>This paper presents ConsumerBench, a comprehensive benchmarking framework designed to evaluate the system efficiency and response time of GenAI models running on end-user devices.
- Score: 6.6246058403368595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent shift in Generative AI (GenAI) applications from cloud-only environments to end-user devices introduces new challenges in resource management, system efficiency, and user experience. This paper presents ConsumerBench, a comprehensive benchmarking framework designed to evaluate the system efficiency and response time of GenAI models running on end-user devices. Unlike existing benchmarks that assume exclusive model access on dedicated GPUs, ConsumerBench simulates realistic multi-application scenarios executing concurrently on constrained hardware. Furthermore, ConsumerBench supports customizable workflows that simulate complex tasks requiring coordination among multiple applications. ConsumerBench captures both application-level metrics, including latency and Service Level Objective (SLO) attainment, and system-level metrics like CPU/GPU utilization and memory bandwidth. Through extensive experiments, ConsumerBench reveals inefficiencies in resource sharing, unfair scheduling under greedy allocation, and performance pitfalls of static model server configurations. The paper also provides practical insights for model developers and system designers, highlighting the benefits of custom kernels tailored to consumer-grade GPU architectures and the value of implementing SLO-aware scheduling strategies.
Related papers
- Agent WARPP: Workflow Adherence via Runtime Parallel Personalization [0.0]
Large language models (LLMs) are increasingly applied in task-oriented dialogue (TOD) systems.<n>We present Adherence via Parallel Personalization, or WARPP, a training-free, modular framework that combines multi-agent runtime with orchestration.<n>By dynamically pruning conditional branches based on user attributes, the framework reduces reasoning overhead and narrows tool selection at runtime.
arXiv Detail & Related papers (2025-07-23T23:27:49Z) - Real-Time Semantic Segmentation of Aerial Images Using an Embedded U-Net: A Comparison of CPU, GPU, and FPGA Workflows [0.0]
This study introduces a lightweight U-Net model optimized for real-time semantic segmentation of aerial images.<n>We maintain the accuracy of the U-Net on a real-world dataset while significantly reducing the model's parameters and Multiply-Accumulate (MAC) operations by a factor of 16.
arXiv Detail & Related papers (2025-03-07T08:33:28Z) - SeBS-Flow: Benchmarking Serverless Cloud Function Workflows [51.4200085836966]
We propose the first serverless workflow benchmarking suite SeBS-Flow.<n>SeBS-Flow includes six real-world application benchmarks and four microbenchmarks representing different computational patterns.<n>We conduct comprehensive evaluations on three major cloud platforms, assessing performance, cost, scalability, and runtime deviations.
arXiv Detail & Related papers (2024-10-04T14:52:18Z) - Large Language Models for Power Scheduling: A User-Centric Approach [6.335540414370735]
We introduce a novel architecture for resource scheduling problems by converting an arbitrary user's voice request (VRQ) into a resource allocation vector.
Specifically, we design an LLM intent recognition agent to translate the request into an optimization problem (OP), an LLM OP parameter identification agent, and an OP solving agent.
arXiv Detail & Related papers (2024-06-29T15:47:28Z) - Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing.
LLMs are extremely computationally expensive, even at inference time.
We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Intelligent Model Update Strategy for Sequential Recommendation [34.02565495747133]
We introduce IntellectReq, abbreviated as IntellectReq. IntellectReq is designed to operate on edge, evaluating the cost-benefit landscape of parameter requests with minimal communication overhead.<n>We employ statistical mapping techniques to convert real-time user behavior into a normal distribution, thereby employing multi-sample outputs to quantify the model's uncertainty and thus its generalization capabilities.
arXiv Detail & Related papers (2023-02-14T20:44:12Z) - MAPLE-Edge: A Runtime Latency Predictor for Edge Devices [80.01591186546793]
We propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware.
Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters.
We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes.
arXiv Detail & Related papers (2022-04-27T14:00:48Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Reproducible Performance Optimization of Complex Applications on the
Edge-to-Cloud Continuum [55.6313942302582]
We propose a methodology to support the optimization of real-life applications on the Edge-to-Cloud Continuum.
Our approach relies on a rigorous analysis of possible configurations in a controlled testbed environment to understand their behaviour.
Our methodology can be generalized to other applications in the Edge-to-Cloud Continuum.
arXiv Detail & Related papers (2021-08-04T07:35:14Z) - Intelligent colocation of HPC workloads [0.0]
Many HPC applications suffer from a bottleneck in the shared caches, instruction execution units, I/O or memory bandwidth, even though the remaining resources may be underutilized.
It is hard for developers and runtime systems to ensure that all critical resources are fully exploited by a single application, so an attractive technique is to colocate multiple applications on the same server.
We show that server efficiency can be improved by first modeling the expected performance degradation of colocated applications based on measured hardware performance counters.
arXiv Detail & Related papers (2021-03-16T12:35:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.