Related papers: Hexcute: A Tile-based Programming Language with Automatic Layout and Task-Mapping Synthesis

Hexcute: A Tile-based Programming Language with Automatic Layout and Task-Mapping Synthesis

URL: http://arxiv.org/abs/2504.16214v2
Date: Wed, 30 Apr 2025 17:29:28 GMT
Title: Hexcute: A Tile-based Programming Language with Automatic Layout and Task-Mapping Synthesis
Authors: Xiao Zhang, Yaoyao Ding, Yang Hu, Gennady Pekhimenko,
Abstract summary: Hexcute is a tile-based programming language that exposes shared memory and register abstractions to enable fine-grained optimization for mixed-type operators.<n>It automates layout and task mapping synthesis with a novel type-inference-based algorithm.<n>Our evaluation shows that Hexcute generalizes to a wide range of DL operators, achieves 1.7-11.28$times$ speedup over existing DL compilers for mixed-type operators, and brings up to 2.91$times$ speedup in the end-to-end evaluation.
Score: 8.742879659920643
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning (DL) workloads mainly run on accelerators like GPUs. Recent DL quantization techniques demand a new matrix multiplication operator with mixed input data types, further complicating GPU optimization. Prior high-level compilers like Triton lack the expressiveness to implement key optimizations like fine-grained data pipelines and hardware-friendly memory layouts for these operators, while low-level programming models, such as Hidet, Graphene, and CUTLASS, require significant programming efforts. To balance expressiveness with engineering effort, we propose Hexcute, a tile-based programming language that exposes shared memory and register abstractions to enable fine-grained optimization for these operators. Additionally, Hexcute leverages task mapping to schedule the GPU program, and to reduce programming efforts, it automates layout and task mapping synthesis with a novel type-inference-based algorithm. Our evaluation shows that Hexcute generalizes to a wide range of DL operators, achieves 1.7-11.28$\times$ speedup over existing DL compilers for mixed-type operators, and brings up to 2.91$\times$ speedup in the end-to-end evaluation.

Related papers

Dato: A Task-Based Programming Model for Dataflow Accelerators [13.87015257740592]
We present Dato, a Python-embedded, task-based programming model for dataflow accelerators.<n>Dato elevates data communication and sharding to first-class type constructs.<n>Dato achieves high performance while significantly reducing the burden of writing optimized code.
arXiv Detail & Related papers (2025-09-08T15:22:51Z)
NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z)
QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach [25.521351239401287]
Heterogeneous deep learning systems (DLS) have been widely deployed in industrial data centers.<n>We propose a novel transcompiler, i.e., QiMeng-Xpiler, for automatically translating programs across DLS.<n>As a result, the programming of DLS is improved by up to 9x via transcompiling legacy programs.
arXiv Detail & Related papers (2025-05-04T15:14:27Z)
GPU accelerated program synthesis: Enumerate semantics, not syntax! [1.3422713954544112]
We build a synthesiser running on GPUs which takes as input positive and negative example traces and returns a logical formula accepting the positive and rejecting the negative traces.<n>With GPU-friendly programming techniques, our synthesiser scales to significantly larger synthesis problems, and operates much faster than the previous CPU-based state-of-the-art.
arXiv Detail & Related papers (2025-04-26T15:06:37Z)
Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation [10.605380159381776]
We introduce Tilus, a domain-specific language for General-Purpose GPU computing.<n>It supports low-precision data types with arbitrary bit widths from 1 to 8.<n>Our experiments demonstrate that Tilus efficiently supports a full spectrum of low-precision data types.
arXiv Detail & Related papers (2025-04-17T14:45:03Z)
CORE: Common Random Reconstruction for Distributed Optimization with Provable Low Communication Complexity [110.50364486645852]
Communication complexity has become a major bottleneck for speeding up training and scaling up machine numbers. We propose Common Om REOm, which can be used to compress information transmitted between machines.
arXiv Detail & Related papers (2023-09-23T08:45:27Z)
Hector: An Efficient Programming and Compilation Framework for Implementing Relational Graph Neural Networks in GPU Architectures [24.841128441671234]
RGNNs are graph neural networks with dedicated structures for modeling the different types of nodes and edges in heterogeneous graphs. We propose Hector, a novel two-level intermediate representation and its code generator framework, to capture the key properties of RGNN models. Hector achieves up to 9.9x speed-up in inference and 43.7x speed-up in training compared with the state-of-the-art public systems.
arXiv Detail & Related papers (2023-01-16T06:53:18Z)
NAPG: Non-Autoregressive Program Generation for Hybrid Tabular-Textual Question Answering [52.10214317661547]
Current numerical reasoning methods autoregressively decode program sequences. The accuracy of program generation drops sharply as the decoding steps unfold due to error propagation. In this paper, we propose a non-autoregressive program generation framework.
arXiv Detail & Related papers (2022-11-07T11:25:21Z)
Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor Programs [11.338285393619042]
We propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering. With the proposed paradigm, we implement a deep learning compiler - Hidet.
arXiv Detail & Related papers (2022-10-18T05:32:13Z)
NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS) We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z)
Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning Workloads [86.62083829086393]
This work introduces the Processing Primitives (TPP), a programming abstraction striving for efficient, portable implementation of Deep Learning-workloads with high-productivity. TPPs define a compact, yet versatile set of 2D-tensor operators (or a virtual ISA), which can be utilized as building-blocks to construct complex operators on high-dimensional tensors. We demonstrate the efficacy of our approach using standalone kernels and end-to-end DL-workloads expressed entirely via TPPs that outperform state-of-the-art implementations on multiple platforms.
arXiv Detail & Related papers (2021-04-12T18:35:49Z)
Systolic Computing on GPUs for Productive Performance [2.8064596842326575]
We propose a language and compiler to productively build high-performance systolic arrays that run on GPUs. A programmer it' specifies a projection of a dataflow compute onto a linear systolic array, while leaving the detailed implementation of the projection to a compiler. The compiler implements the specified projection and maps the linear systolic array to the SIMD execution units and vector registers of GPUs.
arXiv Detail & Related papers (2020-10-29T18:49:54Z)
PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives. We develop novel data reuse analysis algorithms using the polyhedral model. We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)
MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical Models [96.1052289276254]
This work introduces a new MAP-solver, based on the popular Dual Block-Coordinate Ascent principle. Surprisingly, by making a small change to the low-performing solver, we derive the new solver MPLP++ that significantly outperforms all existing solvers by a large margin.
arXiv Detail & Related papers (2020-04-16T16:20:53Z)
ProGraML: Graph-based Deep Learning for Program Optimization and Analysis [16.520971531754018]
We introduce ProGraML, a graph-based program representation for machine learning. ProGraML achieves an average 94.0 F1 score, significantly outperforming the state-of-the-art approaches. We then apply our approach to two high-level tasks - heterogeneous device mapping and program classification - setting new state-of-the-art performance in both.
arXiv Detail & Related papers (2020-03-23T20:27:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.