Related papers: Using Evolutionary Algorithms to Find Cache-Friendly Generalized Morton Layouts for Arrays

Using Evolutionary Algorithms to Find Cache-Friendly Generalized Morton Layouts for Arrays

URL: http://arxiv.org/abs/2309.07002v2
Date: Mon, 5 Feb 2024 16:30:08 GMT
Title: Using Evolutionary Algorithms to Find Cache-Friendly Generalized Morton Layouts for Arrays
Authors: Stephen Nicholas Swatman, Ana-Lucia Varbanescu, Andy D. Pimentel, Andreas Salzburger, Attila Krasznahorkay
Abstract summary: We show how the Morton layout can be generalized to a very large family of multi-dimensional data layouts. We propose a chromosomal representation for such layouts as well as a methodology for estimating the fitness of array layouts. We show that our fitness function correlates to kernel running time in real hardware, and that our evolutionary strategy allows us to find candidates with favorable simulated cache properties.
Score: 0.3749861135832073
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The layout of multi-dimensional data can have a significant impact on the efficacy of hardware caches and, by extension, the performance of applications. Common multi-dimensional layouts include the canonical row-major and column-major layouts as well as the Morton curve layout. In this paper, we describe how the Morton layout can be generalized to a very large family of multi-dimensional data layouts with widely varying performance characteristics. We posit that this design space can be efficiently explored using a combinatorial evolutionary methodology based on genetic algorithms. To this end, we propose a chromosomal representation for such layouts as well as a methodology for estimating the fitness of array layouts using cache simulation. We show that our fitness function correlates to kernel running time in real hardware, and that our evolutionary strategy allows us to find candidates with favorable simulated cache properties in four out of the eight real-world applications under consideration in a small number of generations. Finally, we demonstrate that the array layouts found using our evolutionary method perform well not only in simulated environments but that they can effect significant performance gains -- up to a factor ten in extreme cases -- in real hardware.

Related papers

AcceleratedLiNGAM: Learning Causal DAGs at the speed of GPUs [57.12929098407975]
We show that by efficiently parallelizing existing causal discovery methods, we can scale them to thousands of dimensions. Specifically, we focus on the causal ordering subprocedure in DirectLiNGAM and implement GPU kernels to accelerate it. This allows us to apply DirectLiNGAM to causal inference on large-scale gene expression data with genetic interventions yielding competitive results.
arXiv Detail & Related papers (2024-03-06T15:06:11Z)
Heterogenous Memory Augmented Neural Networks [84.29338268789684]
We introduce a novel heterogeneous memory augmentation approach for neural networks. By introducing learnable memory tokens with attention mechanism, we can effectively boost performance without huge computational overhead. We show our approach on various image and graph-based tasks under both in-distribution (ID) and out-of-distribution (OOD) conditions.
arXiv Detail & Related papers (2023-10-17T01:05:28Z)
Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications. We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z)
Optimizing L1 cache for embedded systems through grammatical evolution [1.9371782627708491]
Grammatical Evolution (GE) is able to efficiently find the best cache configurations for a given set of benchmark applications. Our proposal is able to find cache configurations that obtain an average improvement of $62%$ versus a real world baseline configuration.
arXiv Detail & Related papers (2023-03-06T18:10:00Z)
Dynamic Ensemble Size Adjustment for Memory Constrained Mondrian Forest [0.0]
In this paper, we show that under memory constraints, increasing the size of a tree-based ensemble classifier can worsen its performance. We experimentally show the existence of an optimal ensemble size for a memory-bounded Mondrian forest on data streams. We conclude that our method can achieve up to 95% of the performance of an optimally-sized Mondrian forest for stable datasets.
arXiv Detail & Related papers (2022-10-11T18:05:58Z)
Combating Mode Collapse in GANs via Manifold Entropy Estimation [70.06639443446545]
Generative Adversarial Networks (GANs) have shown compelling results in various tasks and applications. We propose a novel training pipeline to address the mode collapse issue of GANs.
arXiv Detail & Related papers (2022-08-25T12:33:31Z)
DCT-Former: Efficient Self-Attention with Discrete Cosine Transform [4.622165486890318]
An intrinsic limitation of the Trasformer architectures arises from the computation of the dot-product attention. Our idea takes inspiration from the world of lossy data compression (such as the JPEG algorithm) to derive an approximation of the attention module. An extensive section of experiments shows that our method takes up less memory for the same performance, while also drastically reducing inference time.
arXiv Detail & Related papers (2022-03-02T15:25:27Z)
Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states. Our method is widely applicable to classical DP-based inference. It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z)
Adaptive Convolutions with Per-pixel Dynamic Filter Atom [24.691793951360914]
We introduce scalable dynamic convolutions with per-pixel adapted filters. As plug-and-play replacements to convolutional layers, the introduced adaptive convolutions with per-pixel dynamic atoms enable explicit modeling of intra-image variance. We present experiments to show that, the proposed method delivers comparable or even better performance across tasks.
arXiv Detail & Related papers (2021-08-17T22:04:10Z)
On the Difficulty of Designing Processor Arrays for Deep Neural Networks [0.0]
Camuy is a lightweight model of a weight-stationary systolic array for linear algebra operations. We present an analysis of popular models to illustrate how it can estimate required cycles, data movement costs, as well as systolic array utilization.
arXiv Detail & Related papers (2020-06-24T19:24:08Z)
Spatial Pyramid Based Graph Reasoning for Semantic Segmentation [67.47159595239798]
We apply graph convolution into the semantic segmentation task and propose an improved Laplacian. The graph reasoning is directly performed in the original feature space organized as a spatial pyramid. We achieve comparable performance with advantages in computational and memory overhead.
arXiv Detail & Related papers (2020-03-23T12:28:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.