AVX / NEON Intrinsic Functions: When Should They Be Used?
- URL: http://arxiv.org/abs/2601.04922v1
- Date: Thu, 08 Jan 2026 13:21:19 GMT
- Title: AVX / NEON Intrinsic Functions: When Should They Be Used?
- Authors: Théo Boivin, Joeffrey Legaux,
- Abstract summary: Cross-configuration benchmark is proposed to explore the capacities and limitations of AVX / NEON intrinsic functions.<n>Main aim is to guide developers to choose when using intrinsic functions, depending on the OS, architecture and/or available compiler.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A cross-configuration benchmark is proposed to explore the capacities and limitations of AVX / NEON intrinsic functions in a generic context of development project, when a vectorisation strategy is required to optimise the code. The main aim is to guide developers to choose when using intrinsic functions, depending on the OS, architecture and/or available compiler. Intrinsic functions were observed highly efficient in conditional branching, with intrinsic version execution time reaching around 5% of plain code execution time. However, intrinsic functions were observed as unnecessary in many cases, as the compilers already well auto-vectorise the code.
Related papers
- Understanding Accelerator Compilers via Performance Profiling [1.1841612917872066]
Accelerator design languages (ADLs) are high-level languages that compile to hardware units.<n>We introduce Petal, a cycle-level tool for understanding how the compiler's decisions affect performance.<n>We show that Petal's cycle-level profiles can identify performance problems in existing designs.
arXiv Detail & Related papers (2025-11-24T22:40:11Z) - VecIntrinBench: Benchmarking Cross-Architecture Intrinsic Code Migration for RISC-V Vector [8.59222474360646]
Translating intrinsic functions to RISC-V Vector (RVV) intrinsic functions across architectures is currently a mainstream approach.<n>There is currently no benchmark that comprehensively evaluates the intrinsic migration capabilities for the RVV extension.<n>We propose VecIntrinBench, the first intrinsic benchmark encompassing RVV extensions.
arXiv Detail & Related papers (2025-11-24T08:11:10Z) - Library Liberation: Competitive Performance Matmul Through Compiler-composed Nanokernels [37.00431889602245]
This paper introduces a compilation scheme that automatically generates scalable, high-performance micro Kernels.<n>We implement this technique in an MLIR-based compiler supporting both vector and tile based CPU instructions.<n>Experiments show that the generated nano Kernels are of production-quality, and competitive with state-of-the-art micro Kernel libraries.
arXiv Detail & Related papers (2025-11-14T14:32:28Z) - IntrinTrans: LLM-based Intrinsic Code Translator for RISC-V Vector [9.678932711610244]
Translating existing vectorized intrinsic code onto RVV intrinsics is a practical and effective approach.<n>Current cross-architecture translation largely relies on manual rewriting, which is time-consuming and error-prone.<n>We present IntrinTrans, a multi-agent approach that utilizes compile-and-test feedback to translate intrinsic code across architectures automatically.
arXiv Detail & Related papers (2025-10-11T08:52:01Z) - A Walsh Hadamard Derived Linear Vector Symbolic Architecture [83.27945465029167]
Symbolic Vector Architectures (VSAs) are an approach to developing Neuro-symbolic AI.
HLB is designed to have favorable computational efficiency, and efficacy in classic VSA tasks.
arXiv Detail & Related papers (2024-10-30T03:42:59Z) - Breaking Bad: How Compilers Break Constant-Time Implementations [8.771587132463535]
We investigate how compilers break protections introduced by defensive programming techniques.<n>We run a large-scale experiment to see if such compiler-induced issues manifest in state-of-the-art cryptographic libraries.<n>Our study reveals that several compiler-induced secret-dependent operations occur within some of the most highly regarded cryptographic libraries.
arXiv Detail & Related papers (2024-10-17T12:34:02Z) - Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework.
Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - QParallel: Explicit Parallelism for Programming Quantum Computers [62.10004571940546]
We present a language extension for parallel quantum programming.
QParallel removes ambiguities concerning parallelism in current quantum programming languages.
We introduce a tool that guides programmers in the placement of parallel regions by identifying the subroutines that profit most from parallelization.
arXiv Detail & Related papers (2022-10-07T16:35:16Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z) - Towards High Performance, Portability, and Productivity: Lightweight
Augmented Neural Networks for Performance Prediction [0.0]
We propose lightweight augmented neural networks for arbitrary combinations of kernel-variant- hardware.
We are able to obtain a low MAPE of 3%, significantly outperforming traditional feed-forward neural networks.
Our variant-selection approach can be used in Halide implementations to obtain up to 1.7x speedup over Halide's auto-scheduler.
arXiv Detail & Related papers (2020-03-17T02:19:54Z) - PolyScientist: Automatic Loop Transformations Combined with Microkernels
for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels.
We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.