Related papers: WebLLM: A High-Performance In-Browser LLM Inference Engine

WebLLM: A High-Performance In-Browser LLM Inference Engine

URL: http://arxiv.org/abs/2412.15803v1
Date: Fri, 20 Dec 2024 11:24:13 GMT
Title: WebLLM: A High-Performance In-Browser LLM Inference Engine
Authors: Charlie F. Ruan, Yucheng Qin, Xun Zhou, Ruihang Lai, Hongyi Jin, Yixin Dong, Bohan Hou, Meng-Shiun Yu, Yiyan Zhai, Sudeep Agarwal, Hangrui Cao, Siyuan Feng, Tianqi Chen,
Abstract summary: WebLLM is an open-source framework that enables high-performance LLM inference in web browsers.<n>WebLLM provides an OpenAI-style API for seamless integration into web applications.<n> Evaluations show that WebLLM can retain up to 80% native performance on the same device.
Score: 9.771248136952039
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Advancements in large language models (LLMs) have unlocked remarkable capabilities. While deploying these models typically requires server-grade GPUs and cloud-based inference, the recent emergence of smaller open-source models and increasingly powerful consumer devices have made on-device deployment practical. The web browser as a platform for on-device deployment is universally accessible, provides a natural agentic environment, and conveniently abstracts out the different backends from diverse device vendors. To address this opportunity, we introduce WebLLM, an open-source JavaScript framework that enables high-performance LLM inference entirely within web browsers. WebLLM provides an OpenAI-style API for seamless integration into web applications, and leverages WebGPU for efficient local GPU acceleration and WebAssembly for performant CPU computation. With machine learning compilers MLC-LLM and Apache TVM, WebLLM leverages optimized WebGPU kernels, overcoming the absence of performant WebGPU kernel libraries. Evaluations show that WebLLM can retain up to 80% native performance on the same device, with room to further close the gap. WebLLM paves the way for universally accessible, privacy-preserving, personalized, and locally powered LLM applications in web browsers. The code is available at: https://github.com/mlc-ai/web-llm.

Related papers

Scaling On-Device GPU Inference for Large Generative Models [5.938112995772544]
ML Drift is an optimized framework that extends the capabilities of state-of-the-art GPU-accelerated inference engines. Our GPU-accelerated ML/AI inference engine achieves an order-of-magnitude performance improvement relative to existing open-source GPU inference engines.
arXiv Detail & Related papers (2025-05-01T00:44:13Z)
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [87.41051677852231]
We introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks. We create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and spanning multiple applications.
arXiv Detail & Related papers (2024-04-11T17:56:05Z)
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? [115.60866817774641]
Multimodal Large Language models (MLLMs) have shown promise in web-related tasks. evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks. bench is a multimodal benchmark designed to assess the capabilities of MLLMs across a variety of web tasks.
arXiv Detail & Related papers (2024-04-09T02:29:39Z)
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size. These models require high-end hardware, making them inaccessible to most researchers. We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z)
Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems. We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z)
Empowering In-Browser Deep Learning Inference on Edge Devices with Just-in-Time Kernel Optimizations [30.477092899633785]
This paper presents the pioneering inbrowser inference system, nnJIT. nnJIT enables just-in-time (JIT) auto-generation of optimized computing kernels for edge devices. Results show that nnJIT can achieve up to 8.2X faster within 30 seconds compared to the existing baselines.
arXiv Detail & Related papers (2023-09-16T12:29:25Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis [69.15016747150868]
We introduce WebAgent, an agent that learns from self-experience to complete tasks on real websites. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks.
arXiv Detail & Related papers (2023-07-24T14:56:30Z)
WebSHAP: Towards Explaining Any Machine Learning Models Anywhere [13.883867498610172]
We present WebSHAP, the first in-browser tool that adapts the state-of-the-art model-agnostic explainability technique SHAP to the Web environment. Our open-source tool is developed with modern Web technologies such as WebGL that leverage client-side hardware capabilities.
arXiv Detail & Related papers (2023-03-16T17:56:02Z)
Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning [7.05946599544139]
High throughput machine learning (ML) inference servers are critical for online service applications. These servers must provide a bounded latency for each request to support consistent service-level objective (SLO) This paper proposes a new ML inference scheduling framework for multi-model ML inference servers.
arXiv Detail & Related papers (2021-09-01T04:46:46Z)
TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems [5.188829601887422]
Deep learning inference on embedded devices is a burgeoning field with myriad applications because tiny embedded devices are omnipresent. Deep learning inference on embedded devices is a burgeoning field with myriad applications because tiny embedded devices are omnipresent. We introduce Lite Micro, an open-source ML inference framework for running deep-learning models on embedded systems.
arXiv Detail & Related papers (2020-10-17T00:44:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.