Using Evolutionary Algorithms to Find Cache-Friendly Generalized Morton
Layouts for Arrays
- URL: http://arxiv.org/abs/2309.07002v2
- Date: Mon, 5 Feb 2024 16:30:08 GMT
- Title: Using Evolutionary Algorithms to Find Cache-Friendly Generalized Morton
Layouts for Arrays
- Authors: Stephen Nicholas Swatman, Ana-Lucia Varbanescu, Andy D. Pimentel,
Andreas Salzburger, Attila Krasznahorkay
- Abstract summary: We show how the Morton layout can be generalized to a very large family of multi-dimensional data layouts.
We propose a chromosomal representation for such layouts as well as a methodology for estimating the fitness of array layouts.
We show that our fitness function correlates to kernel running time in real hardware, and that our evolutionary strategy allows us to find candidates with favorable simulated cache properties.
- Score: 0.3749861135832073
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The layout of multi-dimensional data can have a significant impact on the
efficacy of hardware caches and, by extension, the performance of applications.
Common multi-dimensional layouts include the canonical row-major and
column-major layouts as well as the Morton curve layout. In this paper, we
describe how the Morton layout can be generalized to a very large family of
multi-dimensional data layouts with widely varying performance characteristics.
We posit that this design space can be efficiently explored using a
combinatorial evolutionary methodology based on genetic algorithms. To this
end, we propose a chromosomal representation for such layouts as well as a
methodology for estimating the fitness of array layouts using cache simulation.
We show that our fitness function correlates to kernel running time in real
hardware, and that our evolutionary strategy allows us to find candidates with
favorable simulated cache properties in four out of the eight real-world
applications under consideration in a small number of generations. Finally, we
demonstrate that the array layouts found using our evolutionary method perform
well not only in simulated environments but that they can effect significant
performance gains -- up to a factor ten in extreme cases -- in real hardware.
Related papers
- AcceleratedLiNGAM: Learning Causal DAGs at the speed of GPUs [57.12929098407975]
We show that by efficiently parallelizing existing causal discovery methods, we can scale them to thousands of dimensions.
Specifically, we focus on the causal ordering subprocedure in DirectLiNGAM and implement GPU kernels to accelerate it.
This allows us to apply DirectLiNGAM to causal inference on large-scale gene expression data with genetic interventions yielding competitive results.
arXiv Detail & Related papers (2024-03-06T15:06:11Z) - Heterogenous Memory Augmented Neural Networks [84.29338268789684]
We introduce a novel heterogeneous memory augmentation approach for neural networks.
By introducing learnable memory tokens with attention mechanism, we can effectively boost performance without huge computational overhead.
We show our approach on various image and graph-based tasks under both in-distribution (ID) and out-of-distribution (OOD) conditions.
arXiv Detail & Related papers (2023-10-17T01:05:28Z) - Performance Embeddings: A Similarity-based Approach to Automatic
Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications.
We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z) - Optimizing L1 cache for embedded systems through grammatical evolution [1.9371782627708491]
Grammatical Evolution (GE) is able to efficiently find the best cache configurations for a given set of benchmark applications.
Our proposal is able to find cache configurations that obtain an average improvement of $62%$ versus a real world baseline configuration.
arXiv Detail & Related papers (2023-03-06T18:10:00Z) - Dynamic Ensemble Size Adjustment for Memory Constrained Mondrian Forest [0.0]
In this paper, we show that under memory constraints, increasing the size of a tree-based ensemble classifier can worsen its performance.
We experimentally show the existence of an optimal ensemble size for a memory-bounded Mondrian forest on data streams.
We conclude that our method can achieve up to 95% of the performance of an optimally-sized Mondrian forest for stable datasets.
arXiv Detail & Related papers (2022-10-11T18:05:58Z) - Combating Mode Collapse in GANs via Manifold Entropy Estimation [70.06639443446545]
Generative Adversarial Networks (GANs) have shown compelling results in various tasks and applications.
We propose a novel training pipeline to address the mode collapse issue of GANs.
arXiv Detail & Related papers (2022-08-25T12:33:31Z) - DCT-Former: Efficient Self-Attention with Discrete Cosine Transform [4.622165486890318]
An intrinsic limitation of the Trasformer architectures arises from the computation of the dot-product attention.
Our idea takes inspiration from the world of lossy data compression (such as the JPEG algorithm) to derive an approximation of the attention module.
An extensive section of experiments shows that our method takes up less memory for the same performance, while also drastically reducing inference time.
arXiv Detail & Related papers (2022-03-02T15:25:27Z) - Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states.
Our method is widely applicable to classical DP-based inference.
It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z) - Adaptive Convolutions with Per-pixel Dynamic Filter Atom [24.691793951360914]
We introduce scalable dynamic convolutions with per-pixel adapted filters.
As plug-and-play replacements to convolutional layers, the introduced adaptive convolutions with per-pixel dynamic atoms enable explicit modeling of intra-image variance.
We present experiments to show that, the proposed method delivers comparable or even better performance across tasks.
arXiv Detail & Related papers (2021-08-17T22:04:10Z) - On the Difficulty of Designing Processor Arrays for Deep Neural Networks [0.0]
Camuy is a lightweight model of a weight-stationary systolic array for linear algebra operations.
We present an analysis of popular models to illustrate how it can estimate required cycles, data movement costs, as well as systolic array utilization.
arXiv Detail & Related papers (2020-06-24T19:24:08Z) - Spatial Pyramid Based Graph Reasoning for Semantic Segmentation [67.47159595239798]
We apply graph convolution into the semantic segmentation task and propose an improved Laplacian.
The graph reasoning is directly performed in the original feature space organized as a spatial pyramid.
We achieve comparable performance with advantages in computational and memory overhead.
arXiv Detail & Related papers (2020-03-23T12:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.