Related papers: Morello: Compiling Fast Neural Networks with Dynamic Programming and Spatial Compression

Morello: Compiling Fast Neural Networks with Dynamic Programming and Spatial Compression

URL: http://arxiv.org/abs/2505.01637v1
Date: Sat, 03 May 2025 00:14:31 GMT
Title: Morello: Compiling Fast Neural Networks with Dynamic Programming and Spatial Compression
Authors: Samuel J. Kaufman, René Just, Rastislav Bodik,
Abstract summary: We introduce a dynamic-programming-based approach to explore more of the search space by decomposing large program specifications into smaller specifications.<n>To reduce memory requirements, we employ a novel memoization table representation, which indexes specifications by coordinates in $Z_geq 0$ and compresses identical, adjacent solutions.
Score: 5.995843028932167
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: High-throughput neural network inference requires coordinating many optimization decisions, including parallel tiling, microkernel selection, and data layout. The product of these decisions forms a search space of programs which is typically intractably large. Existing approaches (e.g., auto-schedulers) often address this problem by sampling this space heuristically. In contrast, we introduce a dynamic-programming-based approach to explore more of the search space by iteratively decomposing large program specifications into smaller specifications reachable from a set of rewrites, then composing a final program from each rewrite that minimizes an affine cost model. To reduce memory requirements, we employ a novel memoization table representation, which indexes specifications by coordinates in $Z_{\geq 0}$ and compresses identical, adjacent solutions. This approach can visit a much larger set of programs than prior work. To evaluate the approach, we developed Morello, a compiler which lowers specifications roughly equivalent to a few-node XLA computation graph to x86. Notably, we found that an affine cost model is sufficient to surface high-throughput programs. For example, Morello synthesized a collection of matrix multiplication benchmarks targeting a Zen 1 CPU, including a 1x2048x16384, bfloat16-to-float32 vector-matrix multiply, which was integrated into Google's gemma.cpp.

Related papers

Quantum Hardware-Efficient Selection of Auxiliary Variables for QUBO Formulations [5.74796205166378]
We present a novel approach for the selection of auxiliary variables tailored for architectures with limited connectivity.<n>We show that, compared to circuits constructed from a QUBO formulation using conventional auxiliary selection methods, the proposed approach reduces the circuit depth by almost 40%.
arXiv Detail & Related papers (2025-11-24T19:00:05Z)
GRACE: Globally-Seeded Representation-Aware Cluster-Specific Evolution for Compiler Auto-Tuning [10.225578019039506]
This paper introduces GRACE, a novel framework for compiler auto-tuning, demonstrated for LLVM IR instruction count optimization.<n> GRACE effectively curtails the search space by leveraging pass synergies and a weighted scoring method to generate initial high-quality candidate sequences and a pass pool.<n>It then employs contrastive learning, using pass sequence-based data augmentation, to create program embeddings that facilitate similarity-aware clustering.
arXiv Detail & Related papers (2025-10-15T06:01:19Z)
Re-Densification Meets Cross-Scale Propagation: Real-Time Neural Compression of LiDAR Point Clouds [83.39320394656855]
LiDAR point clouds are fundamental to various applications, yet high-precision scans incur substantial storage and transmission overhead.<n>Existing methods typically convert unordered points into hierarchical octree or voxel structures for dense-to-sparse predictive coding.<n>Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module re-densifies encoded sparse geometry, extracts features at denser scale, and then re-sparsifies the features for predictive coding.
arXiv Detail & Related papers (2025-08-28T06:36:10Z)
Hexcute: A Tile-based Programming Language with Automatic Layout and Task-Mapping Synthesis [8.742879659920643]
Hexcute is a tile-based programming language that exposes shared memory and register abstractions to enable fine-grained optimization for mixed-type operators.<n>It automates layout and task mapping synthesis with a novel type-inference-based algorithm.<n>Our evaluation shows that Hexcute generalizes to a wide range of DL operators, achieves 1.7-11.28$times$ speedup over existing DL compilers for mixed-type operators, and brings up to 2.91$times$ speedup in the end-to-end evaluation.
arXiv Detail & Related papers (2025-04-22T19:01:28Z)
Tilus: A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving [12.068287973463786]
Serving Large Language Models (LLMs) is critical for AI-powered applications but demands substantial computational resources.<n>Low-precision computation has emerged as a key technique to improve efficiency while reducing resource consumption.<n>Existing approaches for generating low-precision kernels are limited to weight bit widths that are powers of two.
arXiv Detail & Related papers (2025-04-17T14:45:03Z)
Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval. We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z)
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z)
CORE: Common Random Reconstruction for Distributed Optimization with Provable Low Communication Complexity [110.50364486645852]
Communication complexity has become a major bottleneck for speeding up training and scaling up machine numbers. We propose Common Om REOm, which can be used to compress information transmitted between machines.
arXiv Detail & Related papers (2023-09-23T08:45:27Z)
HDCC: A Hyperdimensional Computing compiler for classification on embedded systems and high-performance computing [58.720142291102135]
This work introduces the name compiler, the first open-source compiler that translates high-level descriptions of HDC classification methods into optimized C code. name is designed like a modern compiler, featuring an intuitive and descriptive input language, an intermediate representation (IR), and a retargetable backend. To substantiate these claims, we conducted experiments with HDCC on several of the most popular datasets in the HDC literature.
arXiv Detail & Related papers (2023-04-24T19:16:03Z)
NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS) We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z)
Memory Safe Computations with XLA Compiler [14.510796427699459]
XLA compiler extension adjusts the representation of an algorithm according to a user-specified memory limit. We show that k-nearest neighbour and sparse Gaussian process regression methods can be run at a much larger scale on a single device.
arXiv Detail & Related papers (2022-06-28T16:59:28Z)
A Vertex Cut based Framework for Load Balancing and Parallelism Optimization in Multi-core Systems [15.913119724815733]
High-level applications, such as machine learning, are evolving from simple models based on multilayer perceptrons for simple image recognition to much deeper and more complex neural networks for self-driving vehicle control systems. Parallel programs running on high-performance computers often suffer from data communication bottlenecks, limited memory bandwidth, and synchronization overhead due to irregular critical sections. We propose a framework to reduce the data communication and improve the scalability and performance of these applications in multi-core systems.
arXiv Detail & Related papers (2020-10-09T07:54:28Z)
MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical Models [96.1052289276254]
This work introduces a new MAP-solver, based on the popular Dual Block-Coordinate Ascent principle. Surprisingly, by making a small change to the low-performing solver, we derive the new solver MPLP++ that significantly outperforms all existing solvers by a large margin.
arXiv Detail & Related papers (2020-04-16T16:20:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.