Reconfigurable Low-latency Memory System for Sparse Matricized Tensor
Times Khatri-Rao Product on FPGA
- URL: http://arxiv.org/abs/2109.08874v1
- Date: Sat, 18 Sep 2021 08:19:29 GMT
- Title: Reconfigurable Low-latency Memory System for Sparse Matricized Tensor
Times Khatri-Rao Product on FPGA
- Authors: Sasindu Wijeratne, Rajgopal Kannan, Viktor Prasanna
- Abstract summary: Sparse Matricized Times Khatri-Rao Product (MTTKRP) is one of the most expensive kernels in tensor computations.
This paper focuses on a multi-faceted memory system, which explores the spatial and temporal locality of the data structures of MTTKRP.
Our system shows 2x and 1.26x speedups compared with cache-only and DMA-only memory systems, respectively.
- Score: 3.4870723728779565
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Tensor decomposition has become an essential tool in many applications in
various domains, including machine learning. Sparse Matricized Tensor Times
Khatri-Rao Product (MTTKRP) is one of the most computationally expensive
kernels in tensor computations. Despite having significant computational
parallelism, MTTKRP is a challenging kernel to optimize due to its irregular
memory access characteristics. This paper focuses on a multi-faceted memory
system, which explores the spatial and temporal locality of the data structures
of MTTKRP. Further, users can reconfigure our design depending on the behavior
of the compute units used in the FPGA accelerator. Our system efficiently
accesses all the MTTKRP data structures while reducing the total memory access
time, using a distributed cache and Direct Memory Access (DMA) subsystem.
Moreover, our work improves the memory access time by 3.5x compared with
commercial memory controller IPs. Also, our system shows 2x and 1.26x speedups
compared with cache-only and DMA-only memory systems, respectively.
Related papers
- LiVOS: Light Video Object Segmentation with Gated Linear Matching [116.58237547253935]
LiVOS is a lightweight memory network that employs linear matching via linear attention.
For longer and higher-resolution videos, it matched STM-based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer-grade GPU.
arXiv Detail & Related papers (2024-11-05T05:36:17Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.
At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z) - Pex: Memory-efficient Microcontroller Deep Learning through Partial
Execution [11.336229510791481]
We discuss a novel execution paradigm for microcontroller deep learning.
It modifies the execution of neural networks to avoid materialising full buffers in memory.
This is achieved by exploiting the properties of operators, which can consume/produce a fraction of their input/output at a time.
arXiv Detail & Related papers (2022-11-30T18:47:30Z) - GLEAM: Greedy Learning for Large-Scale Accelerated MRI Reconstruction [50.248694764703714]
Unrolled neural networks have recently achieved state-of-the-art accelerated MRI reconstruction.
These networks unroll iterative optimization algorithms by alternating between physics-based consistency and neural-network based regularization.
We propose Greedy LEarning for Accelerated MRI reconstruction, an efficient training strategy for high-dimensional imaging settings.
arXiv Detail & Related papers (2022-07-18T06:01:29Z) - Towards Programmable Memory Controller for Tensor Decomposition [2.6782615615913348]
Sparse Matricized Times Khatri-Rao Product (MTTKRP) is the pivotal kernel in tensor decomposition algorithms.
This paper explores the opportunities, key challenges, and an approach for designing a custom memory controller on FPGA for MTTKRP.
arXiv Detail & Related papers (2022-07-17T21:32:05Z) - Machine Learning Training on a Real Processing-in-Memory System [9.286176889576996]
Training machine learning algorithms is a computationally intensive process, which is frequently memory-bound.
Memory-centric computing systems with processing-in-memory capabilities can alleviate this data movement bottleneck.
Our work is the first one to evaluate training of machine learning algorithms on a real-world general-purpose PIM architecture.
arXiv Detail & Related papers (2022-06-13T10:20:23Z) - Memory-Guided Semantic Learning Network for Temporal Sentence Grounding [55.31041933103645]
We propose a memory-augmented network that learns and memorizes the rarely appeared content in TSG tasks.
MGSL-Net consists of three main parts: a cross-modal inter-action module, a memory augmentation module, and a heterogeneous attention module.
arXiv Detail & Related papers (2022-01-03T02:32:06Z) - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs.
We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory.
We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z) - Programmable FPGA-based Memory Controller [9.013666207570749]
This paper introduces a modular and programmable memory controller that can be configured for different target applications on available hardware resources.
The proposed memory controller efficiently supports cache-line accesses along with bulk memory transfers.
We show improved overall memory access time up to 58% on CNN and GCN workloads compared with commercial memory controller IPs.
arXiv Detail & Related papers (2021-08-21T23:53:12Z) - PIM-DRAM:Accelerating Machine Learning Workloads using Processing in
Memory based on DRAM Technology [2.6168147530506958]
We propose a processing-in-memory (PIM) multiplication primitive to accelerate matrix vector operations in ML workloads.
We show that the proposed architecture, mapping, and data flow can provide up to 23x and 6.5x benefits over a GPU.
arXiv Detail & Related papers (2021-05-08T16:39:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.