Related papers: D-com: Accelerating Iterative Processing to Enable Low-rank Decomposition of Activations

D-com: Accelerating Iterative Processing to Enable Low-rank Decomposition of Activations

URL: http://arxiv.org/abs/2510.13147v1
Date: Wed, 15 Oct 2025 04:56:36 GMT
Title: D-com: Accelerating Iterative Processing to Enable Low-rank Decomposition of Activations
Authors: Faraz Tahmasebi, Michael Pelluer, Hyoukjun Kwon,
Abstract summary: In this work, we report that the input decomposition can be significantly beneficial with a proper choice of decomposition algorithm and hardware support.<n>We adopt progressive decomposition algorithm, Lanczos algorithm, and design a co-accelerator architecture for the decomposition algorithm.<n>Our accelerator, D-com, provides 22% end-to-end latency improvements compared to A100 GPU at the cost of small model quality degradation.
Score: 2.4698886064068555
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The computation and memory costs of large language models kept increasing over last decade, which reached over the scale of 1T parameters. To address the challenges from the large scale models, model compression techniques such as low-rank decomposition have been explored. Previous model decomposition works have focused on weight decomposition to avoid costly runtime decomposition, whose latency often significantly exceeds the benefits from decomposition (e.g., 38% more end-to-end latency when running Llama2-7b on A100 with 4K sequence length with activation decomposition compared to no decomposition). In this work, we debunk such observations and report that the input decomposition can be significantly beneficial with a proper choice of decomposition algorithm and hardware support. We adopt progressive decomposition algorithm, Lanczos algorithm, and design a co-accelerator architecture for the decomposition algorithm. To address the memory- boundness of the decomposition operation, we introduce a novel compute replication methodology that moves the op- eration toward compute-bound region, which enables 6.2x speedup in our evaluation. We also develop an output shape- preserving computation scheme that eliminates decomposi- tion costs in consecutive layers. To compensate model quality loss from compression, we introduce a multi-track decom- position approach that separately handles outlier channels for high accuracy and low perplexity with minimal compu- tational costs. Combined together, our accelerator, D-com, provides 22% end-to-end latency improvements compared to A100 GPU at the cost of small model quality degradation (e.g., 3% on AI2 Reasoning Challenge task).

Related papers

RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs [5.782015253162346]
Residual binarization enables matmul-free inference by stacking binary layers.<n>We propose RaBiT, a novel quantization framework that resolves coadaptation by algorithmically enforcing a residual hierarchy.<n>RaBiT achieves state-of-the-art performance, rivals even hardware-intensive Vector Quantization (VQ) methods, and delivers a $4.49times$ inference speed-up over full-precision models.
arXiv Detail & Related papers (2026-02-05T06:41:11Z)
A Multi-Stage Optimization Framework for Deploying Learned Image Compression on FPGAs [7.577235739757108]
Deep learning-based image compression (LIC) has achieved state-of-the-art rate-distortion (RD) performance, yet deploying these models on resource-constrained FPGAs remains a major challenge.<n>This work presents a complete, multi-stage optimization framework to bridge the gap between high-performance floating-point models and efficient, hardware-friendly integer-based implementations.
arXiv Detail & Related papers (2025-11-21T10:55:44Z)
Spectral Compression Transformer with Line Pose Graph for Monocular 3D Human Pose Estimation [1.8999296421549172]
We introduce the Spectral Compression Transformer (SCT) to reduce sequence length and accelerate computation.<n>The LPG generates skeletal position information that complements the input 2D joint positions.<n>Our model achieves state-of-the-art performance with improved computational efficiency.
arXiv Detail & Related papers (2025-05-27T15:08:03Z)
The Iterative Chainlet Partitioning Algorithm for the Traveling Salesman Problem with Drone and Neural Acceleration [27.475353583459263]
We introduce the Iterative Chainlet Partitioning (ICP) algorithm and its neural acceleration for solving the Traveling Salesman Problem with Drone (TSP-D)<n>ICP yields an average improvement of 2.6% in solution quality over the previous state-of-the-art algorithm while reducing computational time by 91.3%.<n>Compared to ICP, NICP reduces the total computational time by 28.6%, while the objective function value increase is limited to 0.14%.
arXiv Detail & Related papers (2025-04-21T14:51:15Z)
Q-VLM: Post-training Quantization for Large Vision-Language Models [73.19871905102545]
We propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference.<n>We mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy.<n> Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation.
arXiv Detail & Related papers (2024-10-10T17:02:48Z)
MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework.<n>MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions.<n>Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z)
Multi-Grid Tensorized Fourier Neural Operator for High-Resolution PDEs [93.82811501035569]
We introduce a new data efficient and highly parallelizable operator learning approach with reduced memory requirement and better generalization. MG-TFNO scales to large resolutions by leveraging local and global structures of full-scale, real-world phenomena. We demonstrate superior performance on the turbulent Navier-Stokes equations where we achieve less than half the error with over 150x compression.
arXiv Detail & Related papers (2023-09-29T20:18:52Z)
HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions. We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z)
Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD) We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z)
Fast and Robust Iterative Closest Point [32.42799285301607]
Iterative Closest Point (ICP) is a fundamental technique for rigid registration between two point sets. Recent work such as Sparse ICP achieves robustness via sparsity optimization at the cost of computational speed. We show that the classical point-to-point ICP can be treated as a majorization-minimization (MM) algorithm, and propose an Anderson acceleration approach to speed up its convergence.
arXiv Detail & Related papers (2020-07-15T11:32:53Z)
Combining Deep Learning and Optimization for Security-Constrained Optimal Power Flow [94.24763814458686]
Security-constrained optimal power flow (SCOPF) is fundamental in power systems. Modeling of APR within the SCOPF problem results in complex large-scale mixed-integer programs. This paper proposes a novel approach that combines deep learning and robust optimization techniques.
arXiv Detail & Related papers (2020-07-14T12:38:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.