Quantization-Aware Collaborative Inference for Large Embodied AI Models
- URL: http://arxiv.org/abs/2602.13052v1
- Date: Fri, 13 Feb 2026 16:08:19 GMT
- Title: Quantization-Aware Collaborative Inference for Large Embodied AI Models
- Authors: Zhonghao Lyu, Ming Xiao, Mikael Skoglund, Merouane Debbah, H. Vincent Poor,
- Abstract summary: Large artificial intelligence models (LAIMs) are increasingly regarded as a core intelligence engine for embodied AI applications.<n>To address this issue, we investigate quantization-aware collaborative inference (co-inference) for embodied AI systems.
- Score: 67.66340659245186
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large artificial intelligence models (LAIMs) are increasingly regarded as a core intelligence engine for embodied AI applications. However, the massive parameter scale and computational demands of LAIMs pose significant challenges for resource-limited embodied agents. To address this issue, we investigate quantization-aware collaborative inference (co-inference) for embodied AI systems. First, we develop a tractable approximation for quantization-induced inference distortion. Based on this approximation, we derive lower and upper bounds on the quantization rate-inference distortion function, characterizing its dependence on LAIM statistics, including the quantization bit-width. Next, we formulate a joint quantization bit-width and computation frequency design problem under delay and energy constraints, aiming to minimize the distortion upper bound while ensuring tightness through the corresponding lower bound. Extensive evaluations validate the proposed distortion approximation, the derived rate-distortion bounds, and the effectiveness of the proposed joint design. Particularly, simulations and real-world testbed experiments demonstrate the effectiveness of the proposed joint design in balancing inference quality, latency, and energy consumption in edge embodied AI systems.
Related papers
- Tensor Network Assisted Distributed Variational Quantum Algorithm for Large Scale Combinatorial Optimization Problem [19.046113542182436]
We propose the Distributed Variational Quantum Algorithm (DVQA) for solving Combinatorial Optimization Problems (COPs)<n>A key innovation of DVQA is its use of the truncated higher-order singular value decomposition to preserve inter-variable dependencies without relying on complex long-range entanglement.<n> Empirically, DVQA achieves state-of-the-art performance in simulations and has been experimentally validated on the Wu Kong quantum computer for portfolio optimization.
arXiv Detail & Related papers (2026-01-20T13:31:02Z) - MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z) - Quantum-Classical Hybrid Quantized Neural Network [8.382617481718643]
We present a novel Quadratic Binary Optimization (QBO) model for quantized neural network training, enabling the use of arbitrary activation and loss functions.<n>We employ the Quantum Gradient Conditional Descent (QCGD) algorithm, which leverages quantum computing to directly solve the QCBO problem.
arXiv Detail & Related papers (2025-06-23T02:12:36Z) - Deep Unfolding with Kernel-based Quantization in MIMO Detection [26.033613526407226]
This paper proposes a novel kernel-based adaptive quantization (KAQ) framework for deep unfolding networks.<n>The accuracy of proposed KAQ framework outperforms traditional methods and successfully reduces the model's inference latency.
arXiv Detail & Related papers (2025-05-19T05:50:24Z) - The Larger the Merrier? Efficient Large AI Model Inference in Wireless Edge Networks [56.37880529653111]
The demand for large computation model (LAIM) services is driving a paradigm shift from traditional cloud-based inference to edge-based inference for low-latency, privacy-preserving applications.<n>In this paper, we investigate the LAIM-inference scheme, where a pre-trained LAIM is pruned and partitioned into on-device and on-server sub-models for deployment.
arXiv Detail & Related papers (2025-05-14T08:18:55Z) - QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge [55.75103034526652]
We propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs.<n>Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost.<n>We design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability.
arXiv Detail & Related papers (2025-03-20T21:03:10Z) - Nonlinearity of the Fidelity in Open Qudit Systems: Gate and Noise Dependence in High-dimensional Quantum Computing [0.0]
This paper investigates the Average Gate Fidelity (AGF) of single qudit systems under Markovian noise in the Lindblad formalism.<n>We derive general expressions for the perturbative expansion of the Average Gate Infidelity (AGI) in terms of the environmental coupling coefficient.<n>Our findings highlight the dependence of AGI on qudit dimensionality, quantum gate choice, and noise strength, providing critical insights for optimising quantum gate design and error correction protocols.
arXiv Detail & Related papers (2024-06-21T13:36:09Z) - QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning [52.157939524815866]
In this paper, we identify imbalanced activation distributions as a primary source of quantization difficulty.<n>We propose to adjust these distributions through weight finetuning to be more quantization-friendly.<n>Our method demonstrates its efficacy across three high-resolution image generation tasks.
arXiv Detail & Related papers (2024-02-06T03:39:44Z) - Truncated Non-Uniform Quantization for Distributed SGD [17.30572818507568]
We introduce a novel two-stage quantization strategy to enhance the communication efficiency of distributed gradient Descent (SGD)
The proposed method initially employs truncation to mitigate the impact of long-tail noise, followed by a non-uniform quantization of the post-truncation gradients based on their statistical characteristics.
Our proposed algorithm outperforms existing quantization schemes, striking a superior balance between communication efficiency and convergence performance.
arXiv Detail & Related papers (2024-02-02T05:59:48Z) - Neural Networks with Quantization Constraints [111.42313650830248]
We present a constrained learning approach to quantization training.
We show that the resulting problem is strongly dual and does away with gradient estimations.
We demonstrate that the proposed approach exhibits competitive performance in image classification tasks.
arXiv Detail & Related papers (2022-10-27T17:12:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.