Related papers: LRD-MPC: Efficient MPC Inference through Low-rank Decomposition

LRD-MPC: Efficient MPC Inference through Low-rank Decomposition

URL: http://arxiv.org/abs/2602.14397v1
Date: Mon, 16 Feb 2026 02:11:38 GMT
Title: LRD-MPC: Efficient MPC Inference through Low-rank Decomposition
Authors: Tingting Tang, Yongqin Wang, Murali Annavaram,
Abstract summary: Secure Multi-party Computation enables untrusted parties to jointly compute a function without revealing their inputs.<n>Deep neural networks rely heavily on convolutional and fully connected layers, which require costly matrix multiplications in MPC.<n>We propose leveraging low-rank decomposition (LRD) for linear layers, replacing one large matrix multiplication with two smaller ones.
Score: 11.1852308328843
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Secure Multi-party Computation (MPC) enables untrusted parties to jointly compute a function without revealing their inputs. Its application to machine learning (ML) has gained significant attention, particularly for secure inference services deployed across multiple cloud virtual machines (VMs), where each VM acts as an MPC party. Model providers secret-share model weights, and users secret-share inputs, ensuring that each server operates only on random shares. While MPC provides strong cryptographic guarantees, it incurs substantial computational and communication overhead. Deep neural networks rely heavily on convolutional and fully connected layers, which require costly matrix multiplications in MPC. To reduce this cost, we propose leveraging low-rank decomposition (LRD) for linear layers, replacing one large matrix multiplication with two smaller ones. Each matrix multiplication in MPC incurs a round of communication, meaning decomposing one matrix multiplication into two leads to an additional communication round. Second, the added matrix multiplication requires an additional truncation step to maintain numerical precision. Since truncation itself requires communication and computation, these overheads can offset the gains from decomposition. To address this, we introduce two complementary optimizations: truncation skipping and efficient linear layer concatenation. Truncation skipping removes the extra truncation induced by LRD, while linear layer concatenation pipelines operations to hide the additional communication round. Together, these techniques mitigate the main overheads of LRD in MPC and improve overall efficiency. Our approach is broadly applicable across MPC protocols. Experiments show up to 25% speedup in n-PC and 33% in 3-PC protocols over full-rank baselines, along with up to 52% GPU energy savings and 88% reduction in offline-phase latency.

Related papers

Space Filling Curves is All You Need: Communication-Avoiding Matrix Multiplication Made Simple [42.09057806159106]
General Matrix multiplication is the cornerstone of Deep Learning and HPC workloads.<n>Modern platforms with matrix multiplication accelerators exhibit high FLOP/Byte machine balance.<n>In this work we revisit space filling curves (SFC) to alleviate the problem of this cumbersome tuning.<n>We obtain platform-oblivious and shape-oblivious matrix-multiplication schemes that exhibit inherently high degree of data locality.
arXiv Detail & Related papers (2026-01-22T19:56:16Z)
GSPN-2: Efficient Parallel Sequence Modeling [101.33780567131716]
Generalized Spatial Propagation Network (GSPN) addresses this by replacing quadratic self-attention with a line-scan propagation scheme.<n>GSPN-2 establishes a new efficiency frontier for modeling global spatial context in vision applications.
arXiv Detail & Related papers (2025-11-28T07:26:45Z)
NeuMatC: A General Neural Framework for Fast Parametric Matrix Operation [75.91285900600549]
We propose textbftextitNeural Matrix Computation Framework (NeuMatC), which elegantly tackles general parametric matrix operation tasks.<n>NeuMatC unsupervisedly learns a low-rank and continuous mapping from parameters to their corresponding matrix operation results.<n> Experimental results on both synthetic and real-world datasets demonstrate the promising performance of NeuMatC.
arXiv Detail & Related papers (2025-11-28T07:21:17Z)
CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z)
Breaking the Layer Barrier: Remodeling Private Transformer Inference with Hybrid CKKS and MPC [16.452180247201948]
This paper presents an efficient framework for private Transformer inference that combines Homomorphic Encryption (HE) and Secure Multi-party Computation (MPC) to protect data privacy.<n>The proposed framework, dubbed BLB, overcomes this by breaking down layers into fine-grained operators and further fusing adjacent linear operators, reducing the need for HE/MPC conversions.<n>BLB achieves a $21times$ reduction in communication overhead compared to BOLT (S&P'24) and a $2times$ reduction compared to Bumblebee (NDSS'25), along with latency reductions of $13times$ and $1.8
arXiv Detail & Related papers (2025-08-27T02:40:50Z)
Intra-DP: A High Performance Collaborative Inference System for Mobile Edge Computing [67.98609858326951]
Intra-DP is a high-performance collaborative inference system optimized for deep neural networks (DNNs) on mobile devices.<n>It reduces per-inference latency by up to 50% and energy consumption by up to 75% compared to state-of-the-art baselines.<n>The evaluation demonstrates that Intra-DP reduces per-inference latency by up to 50% and energy consumption by up to 75% compared to state-of-the-art baselines.
arXiv Detail & Related papers (2025-07-08T09:50:57Z)
Joint Transmit and Pinching Beamforming for Pinching Antenna Systems (PASS): Optimization-Based or Learning-Based? [89.05848771674773]
A novel antenna system ()-enabled downlink multi-user multiple-input single-output (MISO) framework is proposed.<n>It consists of multiple waveguides, which equip numerous low-cost antennas, named (PAs)<n>The positions of PAs can be reconfigured to both spanning large-scale path and space.
arXiv Detail & Related papers (2025-02-12T18:54:10Z)
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization [13.622268474310918]
ShiftAddLLM is an efficient multiplication-free model for large language models. It achieves perplexity improvements of 5.6 and 22.7 points at comparable or lower latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM.
arXiv Detail & Related papers (2024-06-10T02:47:55Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)
MPC-Pipe: an Efficient Pipeline Scheme for Secure Multi-party Machine Learning Inference [5.7203077366666015]
We show that it is possible to carefully orchestrate the computation and communication steps to overlap. We propose MPC-Pipe, an efficient MPC system for both training and inference of ML workloads.
arXiv Detail & Related papers (2022-09-27T19:16:26Z)
HD-cos Networks: Efficient Neural Architectures for Secure Multi-Party Computation [26.67099154998755]
Multi-party computation (MPC) is a branch of cryptography where multiple non-colluding parties execute a protocol to securely compute a function. We study training and inference of neural networks under the MPC setup. We show that both of the approaches enjoy strong theoretical motivations and efficient computation under the MPC setup.
arXiv Detail & Related papers (2021-10-28T21:15:11Z)
Straggler-aware Distributed Learning: Communication Computation Latency Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations. In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations. Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.