Optimizing Tensor Train Decomposition in DNNs for RISC-V Architectures Using Design Space Exploration and Compiler Optimizations
- URL: http://arxiv.org/abs/2602.01996v1
- Date: Mon, 02 Feb 2026 11:56:36 GMT
- Title: Optimizing Tensor Train Decomposition in DNNs for RISC-V Architectures Using Design Space Exploration and Compiler Optimizations
- Authors: Theologos Anthimopoulos, Milad Kokhazadeh, Vasilios Kelefouras, Benjamin Himpel, Georgios Keramidas,
- Abstract summary: Low-rank factorization (LRF) offers an effective approach to compressing fully connected layers.<n>This paper introduces an end-to-end LRF design space exploration methodology and a specialized design tool for optimizing FC layers on RISC-V processors.<n>On average, our TT-decomposed layers run 3x faster than IREE and 8x faster than Pluto on the same compressed model.
- Score: 1.37013665345905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks (DNNs) have become indispensable in many real-life applications like natural language processing, and autonomous systems. However, deploying DNNs on resource-constrained devices, e.g., in RISC-V platforms, remains challenging due to the high computational and memory demands of fully connected (FC) layers, which dominate resource consumption. Low-rank factorization (LRF) offers an effective approach to compressing FC layers, but the vast design space of LRF solutions involves complex trade-offs among FLOPs, memory size, inference time, and accuracy, making the LRF process complex and time-consuming. This paper introduces an end-to-end LRF design space exploration methodology and a specialized design tool for optimizing FC layers on RISC-V processors. Using Tensor Train Decomposition (TTD) offered by TensorFlow T3F library, the proposed work prunes the LRF design space by excluding first, inefficient decomposition shapes and second, solutions with poor inference performance on RISC-V architectures. Compiler optimizations are then applied to enhance custom T3F layer performance, minimizing inference time and boosting computational efficiency. On average, our TT-decomposed layers run 3x faster than IREE and 8x faster than Pluto on the same compressed model. This work provides an efficient solution for deploying DNNs on edge and embedded devices powered by RISC-V architectures.
Related papers
- Sequential Reservoir Computing for Efficient High-Dimensional Spatiotemporal Forecasting [1.5313142881179707]
Reservoir Computing (RC) mitigates challenges by replacing backpropagation with fixed recurrent atemporal readout optimization.<n>We introduce a Sequential Reservoir Computing (Sequential RC) architecture that decomposes a large reservoir into a series of smaller, interconnected layers.
arXiv Detail & Related papers (2026-01-01T02:24:56Z) - Unsupervised Learning based Element Resource Allocation for Reconfigurable Intelligent Surfaces in mmWave Network [4.564546073852808]
We formulate a joint optimization problem that optimize the RIS phase configuration and resource allocation under a $alpha$-fair scheduling framework.<n>We propose a five-layer fully connected neural network (FNN) combined with a preprocessing technique to significantly reduce input dimensionality, lower computational complexity, and enhance scalability.<n>The proposed system achieves better performance while reducing computational complexity, making it significantly more scalable than the iterative optimization algorithms.
arXiv Detail & Related papers (2025-09-03T11:56:27Z) - FORTRESS: Function-composition Optimized Real-Time Resilient Structural Segmentation via Kolmogorov-Arnold Enhanced Spatial Attention Networks [1.663204995903499]
FORTRESS (Function-composition Optimized Real-Time Resilient Structural) is a new architecture that balances accuracy and speed by using a special method.<n>Fortress incorporates three key innovations: a systematic depthwise separable convolution framework, adaptive TiKAN integration, and multi-scale attention fusion.<n>The architecture achieves remarkable efficiency gains with 91% parameter reduction (31M to 2.9M), 91% computational complexity reduction (13.7 to 1.17 GFLOPs), and 3x inference speed improvement.
arXiv Detail & Related papers (2025-07-16T23:17:58Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Hyperdimensional Computing Empowered Federated Foundation Model over Wireless Networks for Metaverse [56.384390765357004]
We propose an integrated federated split learning and hyperdimensional computing framework for emerging foundation models.
This novel approach reduces communication costs, computation load, and privacy risks, making it suitable for resource-constrained edge devices in the Metaverse.
arXiv Detail & Related papers (2024-08-26T17:03:14Z) - Input Convex Lipschitz RNN: A Fast and Robust Approach for Engineering Tasks [14.835081385422653]
We introduce a novel network architecture, termed Input Convex Lipschitz Recurrent Neural Networks (ICLRNNs)<n>This architecture seamlessly integrates the benefits of convexity and Lipschitz continuity, enabling fast and robust neural network-based modeling and optimization.<n>It has been successfully applied to practical engineering scenarios, such as modeling and control of chemical process and the modeling and real-world solar irradiance prediction for solar PV system planning.
arXiv Detail & Related papers (2024-01-15T06:26:53Z) - Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR)
CFSR inherits the advantages of both convolution-based and transformer-based approaches.
Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z) - CR-LSO: Convex Neural Architecture Optimization in the Latent Space of Graph Variational Autoencoder with Input Convex Neural Networks [6.026956571669411]
In neural architecture search (NAS) methods based on latent space optimization (LSO), a deep generative model is trained to embed discrete neural architectures into a continuous latent space.<n>This paper develops a convexity architecture regularized space (CRLSO) method, which aims to regularize the learning process of space in order to obtain a convex performance mapping.<n> Experimental results on three popular NAS benchmarks show that CR-LSO achieves competitive evaluation results in terms of both computational complexity and performance.
arXiv Detail & Related papers (2022-11-11T01:55:11Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Efficient Micro-Structured Weight Unification and Pruning for Neural
Network Compression [56.83861738731913]
Deep Neural Network (DNN) models are essential for practical applications, especially for resource limited devices.
Previous unstructured or structured weight pruning methods can hardly truly accelerate inference.
We propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration.
arXiv Detail & Related papers (2021-06-15T17:22:59Z) - Learning to Solve the AC-OPF using Sensitivity-Informed Deep Neural
Networks [52.32646357164739]
We propose a deep neural network (DNN) to solve the solutions of the optimal power flow (ACOPF)
The proposed SIDNN is compatible with a broad range of OPF schemes.
It can be seamlessly integrated in other learning-to-OPF schemes.
arXiv Detail & Related papers (2021-03-27T00:45:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.