Related papers: COGNATE: Acceleration of Sparse Tensor Programs on Emerging Hardware using Transfer Learning

COGNATE: Acceleration of Sparse Tensor Programs on Emerging Hardware using Transfer Learning

URL: http://arxiv.org/abs/2506.00424v2
Date: Sun, 15 Jun 2025 01:10:00 GMT
Title: COGNATE: Acceleration of Sparse Tensor Programs on Emerging Hardware using Transfer Learning
Authors: Chamika Sudusinghe, Gerasimos Gerogiannis, Damitha Lenadora, Charles Block, Josep Torrellas, Charith Mendis,
Abstract summary: COGNATE is a novel framework that leverages inexpensive data samples from general-purpose hardware to train cost models.<n>We demonstrate that COGNATE outperforms existing techniques, achieving average speedups of 1.47x (up to 5.46x) for SpMM and 1.39x (up to 4.22x) for SDDMM.
Score: 6.884173899890476
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse tensor programs are essential in deep learning and graph analytics, driving the need for optimized processing. To meet this demand, specialized hardware accelerators are being developed. Optimizing these programs for accelerators is challenging for two reasons: program performance is highly sensitive to variations in sparse inputs, and early-stage accelerators rely on expensive simulators. Therefore, ML-based cost models used for optimizing such programs on general-purpose hardware are often ineffective for early-stage accelerators, as they require large datasets for proper training. To this end, we introduce COGNATE, a novel framework that leverages inexpensive data samples from general-purpose hardware (e.g., CPUs) to train cost models, followed by few-shot fine-tuning on emerging hardware. COGNATE exploits the homogeneity of input features across hardware platforms while effectively mitigating heterogeneity, enabling cost model training with just 5% of the data samples needed by accelerator-specific models to achieve comparable performance. We conduct extensive experiments to demonstrate that COGNATE outperforms existing techniques, achieving average speedups of 1.47x (up to 5.46x) for SpMM and 1.39x (up to 4.22x) for SDDMM.

Related papers

Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning [52.25669274734097]
Ent-Prog is an efficient training framework tailored for diffusion models on human video generation.<n>It reduces both training time and GPU memory consumption while maintaining model performance.
arXiv Detail & Related papers (2025-11-26T07:36:37Z)
Low-Level and NUMA-Aware Optimization for High-Performance Quantum Simulation [0.3280871442296501]
This work focuses on performance enhancements through targeted low-level and NUMA-aware tuning on a single-node system.<n>We introduce an open-source, high-performance extension to the QuEST state vector simulator that integrates state-of-the-art low-level and NUMA-aware optimizations for modern processors.
arXiv Detail & Related papers (2025-06-10T19:35:11Z)
MiniCPM4: Ultra-Efficient LLMs on End Devices [124.73631357883228]
MiniCPM4 is a highly efficient large language model (LLM) designed explicitly for end-side devices.<n>We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.<n>MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively.
arXiv Detail & Related papers (2025-06-09T16:16:50Z)
NeuroSim V1.5: Improved Software Backbone for Benchmarking Compute-in-Memory Accelerators with Device and Circuit-level Non-idealities [2.141889595429907]
We present NeuroSim V1.5, which introduces key advances in device- and circuit-level non-idealities.<n>NeuroSim V1.5 advances the design and validation of next-generation ACIM accelerators.<n>All NeuroSim versions are available open-source at https://github.com/neurosim/NeuroSim.
arXiv Detail & Related papers (2025-05-05T02:07:04Z)
Fake Runs, Real Fixes -- Analyzing xPU Performance Through Simulation [4.573673188291683]
We present xPU-Shark, a fine-grained methodology for analyzing ML models at the machine-code level.<n>xPU-Shark captures traces from production deployments running on accelerators and replays them in a modified microarchitecture simulator.<n>We optimize a common communication collective by up to 15% and reduce token generation latency by up to 4.1%.
arXiv Detail & Related papers (2025-03-18T23:15:02Z)
Hardware-Software Co-optimised Fast and Accurate Deep Reconfigurable Spiking Inference Accelerator Architecture Design Methodology [2.968768532937366]
Spiking Neural Networks (SNNs) have emerged as a promising approach to improve the energy efficiency of machine learning models. We develop a hardware-software co-optimisation strategy to port software-trained deep neural networks (DNN) to reduced-precision spiking models.
arXiv Detail & Related papers (2024-10-07T05:04:13Z)
Learning Performance-Improving Code Edits [107.21538852090208]
We introduce a framework for adapting large language models (LLMs) to high-level program optimization. First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs. For prompting, we propose retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play.
arXiv Detail & Related papers (2023-02-15T18:59:21Z)
MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge [87.41163540910854]
Deep neural network (DNN) latency characterization is a time-consuming process. We propose MAPLE-X which extends MAPLE by incorporating explicit prior knowledge of hardware devices and DNN architecture latency.
arXiv Detail & Related papers (2022-05-25T11:08:20Z)
FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task. The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources. It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z)
Moses: Efficient Exploitation of Cross-device Transferable Features for Tensor Program Optimization [10.115260534967645]
We propose Moses, a simple and efficient design based on the lottery ticket hypothesis. Compared with state-of-the-art approaches, Moses achieves up to 1.53X efficiency gain in the search stage.
arXiv Detail & Related papers (2022-01-15T03:55:52Z)
Real-Time GPU-Accelerated Machine Learning Based Multiuser Detection for 5G and Beyond [70.81551587109833]
nonlinear beamforming filters can significantly outperform linear approaches in stationary scenarios with massive connectivity. One of the main challenges comes from the real-time implementation of these algorithms. This paper explores the acceleration of APSM-based algorithms through massive parallelization.
arXiv Detail & Related papers (2022-01-13T15:20:45Z)
Data-Driven Offline Optimization For Architecting Hardware Accelerators [89.68870139177785]
We develop a data-driven offline optimization method for designing hardware accelerators, dubbed PRIME. PRIME improves performance upon state-of-the-art simulation-driven methods by about 1.54x and 1.20x, while considerably reducing the required total simulation time by 93% and 99%, respectively. In addition, PRIME also architects effective accelerators for unseen applications in a zero-shot setting, outperforming simulation-based methods by 1.26x.
arXiv Detail & Related papers (2021-10-20T17:06:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.