Related papers: TriADA: Massively Parallel Trilinear Matrix-by-Tensor Multiply-Add Algorithm and Device Architecture for the Acceleration of 3D Discrete Transformations

TriADA: Massively Parallel Trilinear Matrix-by-Tensor Multiply-Add Algorithm and Device Architecture for the Acceleration of 3D Discrete Transformations

URL: http://arxiv.org/abs/2506.22818v1
Date: Sat, 28 Jun 2025 08:42:01 GMT
Title: TriADA: Massively Parallel Trilinear Matrix-by-Tensor Multiply-Add Algorithm and Device Architecture for the Acceleration of 3D Discrete Transformations
Authors: Stanislav Sedukhin, Yoichi Tomioka, Kazuya Matsumoto, Yuichi Okuyama,
Abstract summary: Multilinear transformations are key in high-performance computing ( HPC) and artificial intelligence (AI) workloads.<n> scaling computation by enlarging the number of parallel processing units substantially increases energy consumption.<n>TriADA is capable of performing a variety of trilinear transformations with hypercubic arithmetic complexity in a linear number of time-steps.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multilinear transformations are key in high-performance computing (HPC) and artificial intelligence (AI) workloads, where data is represented as tensors. However, their high computational and memory demands, which grow with dimensionality, often slow down critical tasks. Moreover, scaling computation by enlarging the number of parallel processing units substantially increases energy consumption, limiting widespread adoption, especially for sparse data, which is common in HPC and AI applications. This paper introduces the Trilinear Algorithm and isomorphic to algorithm Device Architecture (TriADA) to address these challenges with the following innovations: (1) a massively parallel, low-rank algorithm for computing a family of trilinear (3D) discrete orthogonal transformations (3D-DXTs), which is a special case of the more general 3-mode matrix-by-tensor multiplication (3D-GEMT); (2) a new outer-product-based GEMM kernel with decoupled streaming active memory, specially designed to accelerate 3D-GEMT operation; (3) an isomorphic to the proposed algorithm, fully distributed 3D network of mesh interconnected processing elements or cells with a coordinate-free, data-driven local processing activity, which is independent of problem size; (4) an elastic sparse outer-product (ESOP) method that avoids unnecessary computing and communication operations with zero-valued operands, thereby enhancing energy efficiency, computational accuracy, and stability. TriADA is capable of performing a variety of trilinear transformations with hypercubic arithmetic complexity in a linear number of time-steps. The massively parallel, scalable, and energy-efficient architecture of TriADA is ideal for accelerating multilinear tensor operations, which are the most demanding parts of AI and HPC workloads.

Related papers

Geometric Operator Learning with Optimal Transport [77.16909146519227]
We propose integrating optimal transport (OT) into operator learning for partial differential equations (PDEs) on complex geometries.<n>For 3D simulations focused on surfaces, our OT-based neural operator embeds the surface geometry into a 2D parameterized latent space.<n> Experiments with Reynolds-averaged Navier-Stokes equations (RANS) on the ShapeNet-Car and DrivAerNet-Car datasets show that our method achieves better accuracy and also reduces computational expenses.
arXiv Detail & Related papers (2025-07-26T21:28:25Z)
Fused3S: Fast Sparse Attention on Tensor Cores [3.6068301267188]
This paper introduces Fused3S, the first fused 3S algorithm that jointly maximizes tensor core utilization and minimizes data movement.<n>Across real-world graph datasets, Fused3S $1.6- 16.3times$ and $1.5-14times$ speedup over state-of-the-art on H100 and A30 GPU.
arXiv Detail & Related papers (2025-05-12T22:09:05Z)
Higher Order Transformers: Efficient Attention Mechanism for Tensor Structured Data [10.327160288730125]
Higher-Order Transformers (HOT) are designed to process data with more than two axes, i.e. higher-order tensors.<n>To address the computational challenges associated with high-order tensor attention, we introduce a novel Kronecker factorized attention mechanism.<n>We validate the effectiveness of HOT on two high-dimensional tasks, including multivariate time series forecasting, and 3D medical image classification.
arXiv Detail & Related papers (2024-12-04T00:10:47Z)
Scaling Studies for Efficient Parameter Search and Parallelism for Large Language Model Pre-training [2.875838666718042]
We focus on parallel and distributed machine learning algorithm development, specifically for optimizing the data processing and pre-training of a set of 5 encoder-decoder LLMs. We performed a fine-grained study to quantify the relationships between three ML methods, specifically exploring Microsoft DeepSpeed Zero Redundancy stages.
arXiv Detail & Related papers (2023-10-09T02:22:00Z)
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z)
ADC/DAC-Free Analog Acceleration of Deep Neural Networks with Frequency Transformation [2.7488316163114823]
This paper proposes a novel approach to an energy-efficient acceleration of frequency-domain neural networks by utilizing analog-domain frequency-based tensor transformations. Our approach achieves more compact cells by eliminating the need for trainable parameters in the transformation matrix. On a 16$times$16 crossbars, for 8-bit input processing, the proposed approach achieves the energy efficiency of 1602 tera operations per second per Watt.
arXiv Detail & Related papers (2023-09-04T19:19:39Z)
Geometry-Informed Neural Operator for Large-Scale 3D PDEs [76.06115572844882]
We propose the geometry-informed neural operator (GINO) to learn the solution operator of large-scale partial differential equations. We successfully trained GINO to predict the pressure on car surfaces using only five hundred data points.
arXiv Detail & Related papers (2023-09-01T16:59:21Z)
RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z)
UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features. Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z)
Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z)
Large Scale Distributed Linear Algebra With Tensor Processing Units [0.0]
We have curated Google Processing Units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The matrix-multiply units (MXU)s dominate the runtime, yielding impressive scaling, performance, and raw size.
arXiv Detail & Related papers (2021-12-16T16:55:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.