Related papers: WebANNS: Fast and Efficient Approximate Nearest Neighbor Search in Web Browsers

WebANNS: Fast and Efficient Approximate Nearest Neighbor Search in Web Browsers

URL: http://arxiv.org/abs/2507.00521v2
Date: Wed, 02 Jul 2025 02:20:54 GMT
Title: WebANNS: Fast and Efficient Approximate Nearest Neighbor Search in Web Browsers
Authors: Mugeng Liu, Siqi Zhong, Qi Yang, Yudong Han, Xuanzhe Liu, Yun Ma,
Abstract summary: In-browser nearest neighbor search (ANNS) has become vital to modern AI infrastructure.<n>We propose WebANNS, a novel ANNS engine specifically designed for web browsers.
Score: 4.817548755757474
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Approximate nearest neighbor search (ANNS) has become vital to modern AI infrastructure, particularly in retrieval-augmented generation (RAG) applications. Numerous in-browser ANNS engines have emerged to seamlessly integrate with popular LLM-based web applications, while addressing privacy protection and challenges of heterogeneous device deployments. However, web browsers present unique challenges for ANNS, including computational limitations, external storage access issues, and memory utilization constraints, which state-of-the-art (SOTA) solutions fail to address comprehensively. We propose WebANNS, a novel ANNS engine specifically designed for web browsers. WebANNS leverages WebAssembly to overcome computational bottlenecks, designs a lazy loading strategy to optimize data retrieval from external storage, and applies a heuristic approach to reduce memory usage. Experiments show that WebANNS is fast and memory efficient, achieving up to $743.8\times$ improvement in 99th percentile query latency over the SOTA engine, while reducing memory usage by up to 39\%. Note that WebANNS decreases query time from 10 seconds to the 10-millisecond range in browsers, making in-browser ANNS practical with user-acceptable latency.

Related papers

LoRaConnect: Unlocking HTTP Potential on LoRa Backbones for Remote Areas and Ad-Hoc Networks [26.152275462641168]
We propose LoRaConnect to enable HTTP access over LoRa.<n>LoRaWeb hardware tethers a WiFi hotspot to which client devices connect and access HTTP resources over LoRa.<n>LoRaWeb achieves an average throughput of 1.18 KB/S approximately, with an access delay of only 1.3 S approximately for a 1.5KB webpage.
arXiv Detail & Related papers (2025-01-05T07:41:53Z)
Anatomizing Deep Learning Inference in Web Browsers [17.63663828498732]
We make the first comprehensive performance measurement of in-browser inference to date. Our approach proposes new metrics to measure in-browser inference: responsiveness, smoothness, and inference accuracy. In-browser inference exhibits a substantial latency gap, averaging 16.9 times slower on CPU and 4.9 times slower on GPU compared to native inference on PC devices.
arXiv Detail & Related papers (2024-02-08T08:02:57Z)
LitE-SNN: Designing Lightweight and Efficient Spiking Neural Network through Spatial-Temporal Compressive Network Search and Joint Optimization [48.41286573672824]
Spiking Neural Networks (SNNs) mimic the information-processing mechanisms of the human brain and are highly energy-efficient. We propose a new approach named LitE-SNN that incorporates both spatial and temporal compression into the automated network design process.
arXiv Detail & Related papers (2024-01-26T05:23:11Z)
Spiker+: a framework for the generation of efficient Spiking Neural Networks FPGA accelerators for inference at the edge [49.42371633618761]
Spiker+ is a framework for generating efficient, low-power, and low-area customized Spiking Neural Networks (SNN) accelerators on FPGA for inference at the edge. Spiker+ is tested on two benchmark datasets, the MNIST and the Spiking Heidelberg Digits (SHD)
arXiv Detail & Related papers (2024-01-02T10:42:42Z)
Empowering In-Browser Deep Learning Inference on Edge Devices with Just-in-Time Kernel Optimizations [30.477092899633785]
This paper presents the pioneering inbrowser inference system, nnJIT. nnJIT enables just-in-time (JIT) auto-generation of optimized computing kernels for edge devices. Results show that nnJIT can achieve up to 8.2X faster within 30 seconds compared to the existing baselines.
arXiv Detail & Related papers (2023-09-16T12:29:25Z)
MAPLE-Edge: A Runtime Latency Predictor for Edge Devices [80.01591186546793]
We propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware. Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters. We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes.
arXiv Detail & Related papers (2022-04-27T14:00:48Z)
FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task. The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources. It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z)
SmartDet: Context-Aware Dynamic Control of Edge Task Offloading for Mobile Object Detection [19.106380479438172]
Mobile devices increasingly rely on object detection (OD) through deep neural networks (DNNs) to perform critical tasks. Low-complexity object tracking (OT) can be used with OD, where the latter is periodically applied to generate "fresh" references for tracking. We propose parallel OT (at the mobile device) and OD (at the edge server) processes that are resilient to large OD latency.
arXiv Detail & Related papers (2022-01-11T23:01:35Z)
Achieving on-Mobile Real-Time Super-Resolution with Neural Architecture and Pruning Search [64.80878113422824]
We propose an automatic search framework that derives sparse super-resolution (SR) models with high image quality while satisfying the real-time inference requirement. With the proposed framework, we are the first to achieve real-time SR inference (with only tens of milliseconds per frame) for implementing 720p resolution with competitive image quality.
arXiv Detail & Related papers (2021-08-18T06:47:31Z)
Enabling Homomorphically Encrypted Inference for Large DNN Models [1.0679692136113117]
Homomorphic encryption (HE) enables inference using encrypted data but it incurs 100x--10,000x memory and runtime overheads. Secure deep neural network (DNN) inference using HE is currently limited by computing and memory resources. We explore the feasibility of leveraging hybrid memory systems comprised of DRAM and persistent memory.
arXiv Detail & Related papers (2021-03-30T07:53:34Z)
Accelerating Deep Learning Inference via Learned Caches [11.617579969991294]
Deep Neural Networks (DNNs) are witnessing increased adoption in multiple domains owing to their high accuracy in solving real-world problems. Current low latency solutions trade-off on accuracy or fail to exploit the inherent temporal locality in prediction serving workloads. We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency inference.
arXiv Detail & Related papers (2021-01-18T22:13:08Z)
MS-RANAS: Multi-Scale Resource-Aware Neural Architecture Search [94.80212602202518]
We propose Multi-Scale Resource-Aware Neural Architecture Search (MS-RANAS) We employ a one-shot architecture search approach in order to obtain a reduced search cost. We achieve state-of-the-art results in terms of accuracy-speed trade-off.
arXiv Detail & Related papers (2020-09-29T11:56:01Z)
Towards Real-Time DNN Inference on Mobile Platforms with Model Pruning and Compiler Optimization [56.3111706960878]
High-end mobile platforms serve as primary computing devices for a wide range of Deep Neural Network (DNN) applications. constrained computation and storage resources on these devices pose significant challenges for real-time inference executions. We propose a set of hardware-friendly structured model pruning and compiler optimization techniques to accelerate DNN executions on mobile devices.
arXiv Detail & Related papers (2020-04-22T03:18:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.