Related papers: MC$^2$A: Enabling Algorithm-Hardware Co-Design for Efficient Markov Chain Monte Carlo Acceleration

MC$^2$A: Enabling Algorithm-Hardware Co-Design for Efficient Markov Chain Monte Carlo Acceleration

URL: http://arxiv.org/abs/2507.12935v1
Date: Thu, 17 Jul 2025 09:20:51 GMT
Title: MC$^2$A: Enabling Algorithm-Hardware Co-Design for Efficient Markov Chain Monte Carlo Acceleration
Authors: Shirui Zhao, Jun Yin, Lingyun Yao, Martin Andraud, Wannes Meert, Marian Verhelst,
Abstract summary: textbfMC$2$A is an algorithm-hardware co-design framework for MCMC acceleration.<n>textbfMC$2$A achieves an overall $307.6times$, $1.4times$, $2.0times$, $84.2times$ speedup compared to the CPU, GPU, TPU and state-of-the-art MCMC accelerator.
Score: 9.064931467874807
License: http://creativecommons.org/licenses/by/4.0/
Abstract: An increasing number of applications are exploiting sampling-based algorithms for planning, optimization, and inference. The Markov Chain Monte Carlo (MCMC) algorithms form the computational backbone of this emerging branch of machine learning. Unfortunately, the high computational cost limits their feasibility for large-scale problems and real-world applications, and the existing MCMC acceleration solutions are either limited in hardware flexibility or fail to maintain efficiency at the system level across a variety of end-to-end applications. This paper introduces \textbf{MC$^2$A}, an algorithm-hardware co-design framework, enabling efficient and flexible optimization for MCMC acceleration. Firstly, \textbf{MC$^2$A} analyzes the MCMC workload diversity through an extension of the processor performance roofline model with a 3rd dimension to derive the optimal balance between the compute, sampling and memory parameters. Secondly, \textbf{MC$^2$A} proposes a parametrized hardware accelerator architecture with flexible and efficient support of MCMC kernels with a pipeline of ISA-programmable tree-structured processing units, reconfigurable samplers and a crossbar interconnect to support irregular access. Thirdly, the core of \textbf{MC$^2$A} is powered by a novel Gumbel sampler that eliminates exponential and normalization operations. In the end-to-end case study, \textbf{MC$^2$A} achieves an overall {$307.6\times$, $1.4\times$, $2.0\times$, $84.2\times$} speedup compared to the CPU, GPU, TPU and state-of-the-art MCMC accelerator. Evaluated on various representative MCMC workloads, this work demonstrates and exploits the feasibility of general hardware acceleration to popularize MCMC-based solutions in diverse application domains.

Related papers

Fast Monte Carlo Tree Diffusion: 100x Speedup via Parallel Sparse Planning [61.694143925237206]
Recently proposed Monte Carlo Tree Diffusion (MCTD) offers a promising solution by combining diffusion with tree-based search.<n>Fast-MCTD integrates two techniques: Parallel MCTD, which enables parallel rollouts via delayed tree updates and redundancy-aware selection; and Sparse MCTD, which reduces rollout length through trajectory coarsening.<n>Experiments show that Fast-MCTD achieves up to 100x speedup over standard MCTD while maintaining or improving planning performance.
arXiv Detail & Related papers (2025-06-11T08:17:40Z)
Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving [14.607254882119507]
Combination of experts (MoE) model is a sparse variant of large language models (LLMs)<n>Despite its benefits, MoE is still too expensive to deploy on resource-constrained edge devices.<n>We propose D$2$MoE, an algorithm-system co-design framework that matches diverse task requirements by dynamically allocating the most proper bit-width to each expert.
arXiv Detail & Related papers (2025-04-17T05:37:35Z)
Efficiently Vectorized MCMC on Modern Accelerators [1.952427698056566]
We show how to design single-chain MCMC algorithms in a way that avoids synchronization overheads when vectorizing with tools like $textttvmap$ by using the framework of finite state machines (FSMs)<n>We implement several popular MCMC algorithms as FSMs, including Slice Sampling, HMC-NUTS and Delayed Rejection, demonstrating speed-ups of up to an order of magnitude.
arXiv Detail & Related papers (2025-03-20T16:07:14Z)
AutoStep: Locally adaptive involutive MCMC [51.186543293659376]
We present a novel class of involutive MCMC methods -- AutoStep MCMC -- that selects an appropriate step size at each iteration adapted to the local geometry of the target distribution.<n>We show that AutoStep MCMC is competitive with state-of-the-art methods in terms of effective sample size per unit cost on a range of challenging target distributions.
arXiv Detail & Related papers (2024-10-24T17:17:11Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
Accelerating Look-ahead in Bayesian Optimization: Multilevel Monte Carlo is All you Need [5.283807323380133]
Multilevel Monte Carlo (MLCBOC) is capable of achieving the canonical MC convergence rate. Our theoretical study focuses on the approximation improvements for twoand three-step look-ahead acquisition functions. Our findings are verified numerically and the benefits ofCBOC for BO are illustrated on several benchmark examples.
arXiv Detail & Related papers (2024-02-03T10:24:30Z)
Efficient Controllable Multi-Task Architectures [85.76598445904374]
We propose a multi-task model consisting of a shared encoder and task-specific decoders where both encoder and decoder channel widths are slimmable. Our key idea is to control the task importance by varying the capacities of task-specific decoders, while controlling the total computational cost. This improves overall accuracy by allowing a stronger encoder for a given budget, increases control over computational cost, and delivers high-quality slimmed sub-architectures.
arXiv Detail & Related papers (2023-08-22T19:09:56Z)
M3ICRO: Machine Learning-Enabled Compact Photonic Tensor Core based on PRogrammable Multi-Operand Multimode Interference [18.0155410476884]
Photonic tensor core (PTC) designs based on standard optical components hinder scalability and compute density due to their large spatial footprint. We propose an ultra-compact PTC using customized programmable multi-operand multimode interference (MOMMI) devices, named M3ICRO. M3ICRO achieves a 3.4-9.6x smaller footprint, 1.6-4.4x higher speed, 10.6-42x higher compute density, 3.7-12x higher system throughput, and superior noise robustness.
arXiv Detail & Related papers (2023-05-31T02:34:36Z)
Efficient Micro-Structured Weight Unification and Pruning for Neural Network Compression [56.83861738731913]
Deep Neural Network (DNN) models are essential for practical applications, especially for resource limited devices. Previous unstructured or structured weight pruning methods can hardly truly accelerate inference. We propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration.
arXiv Detail & Related papers (2021-06-15T17:22:59Z)
Involutive MCMC: a Unifying Framework [64.46316409766764]
We describe a wide range of MCMC algorithms in terms of iMCMC. We formulate a number of "tricks" which one can use as design principles for developing new MCMC algorithms. We demonstrate the latter with two examples where we transform known reversible MCMC algorithms into more efficient irreversible ones.
arXiv Detail & Related papers (2020-06-30T10:21:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.