Related papers: oneDAL Optimization for ARM Scalable Vector Extension: Maximizing Efficiency for High-Performance Data Science

oneDAL Optimization for ARM Scalable Vector Extension: Maximizing Efficiency for High-Performance Data Science

URL: http://arxiv.org/abs/2504.04241v1
Date: Sat, 05 Apr 2025 17:53:36 GMT
Title: oneDAL Optimization for ARM Scalable Vector Extension: Maximizing Efficiency for High-Performance Data Science
Authors: Chandan Sharma, Rakshith GB, Ajay Kumar Patel, Dhanus M Lal, Darshan Patel, Ragesh Hajela, Masahiro Doteguchi, Priyanka Sharma,
Abstract summary: UXL's oneAPI Data Analytics Library (oneDAL) is widely adopted for accelerating ML and data analytics.<n>But its reliance on Intel's Math Kernel Library (MKL) has traditionally limited its compatibility to x86platforms.<n>This paper details the porting of oneDAL to ARM architectures with SVE support, using OpenBLAS as an alternative backend to overcome architectural and performance challenges.
Score: 1.5672115019395867
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The evolution of ARM-based architectures, particularly those incorporating Scalable Vector Extension (SVE), has introduced transformative opportunities for high-performance computing (HPC) and machine learning (ML) workloads. The Unified Acceleration Foundation's (UXL) oneAPI Data Analytics Library (oneDAL) is a widely adopted library for accelerating ML and data analytics workflows, but its reliance on Intel's proprietary Math Kernel Library (MKL) has traditionally limited its compatibility to x86platforms. This paper details the porting of oneDAL to ARM architectures with SVE support, using OpenBLAS as an alternative backend to overcome architectural and performance challenges. Beyond porting, the research introduces novel ARM-specific optimizations, including custom sparse matrix routines, vectorized statistical functions, and a Scalable Vector Extension (SVE)-optimized Support Vector Machine (SVM) algorithm. The SVM enhancements leverage SVE's flexible vector lengths and predicate driven execution, achieving notable performance gains of 22% for the Boser method and 5% for the Thunder method. Benchmarks conducted on ARM SVE-enabled AWSGraviton3 instances showcase up to 200x acceleration in ML training and inference tasks compared to the original scikit-learn implementation on the ARM platform. Moreover, the ARM-optimized oneDAL achieves performance parity with, and in some cases exceeds, the x86 oneDAL implementation (MKL backend) on IceLake x86 systems, which are nearly twice as costly as AWSGraviton3 ARM instances. These findings highlight ARM's potential as a high-performance, energyefficient platform for dataintensive ML applications. By expanding cross-architecture compatibility and contributing to the opensource ecosystem, this work reinforces ARM's position as a competitive alternative in the HPC and ML domains, paving the way for future advancements in dataintensive computing.

Related papers

Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation [129.45368843861917]
We introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers.<n>We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs to share memory readout states from a Samba-based self-decoder.
arXiv Detail & Related papers (2025-07-09T07:27:00Z)
High-Performance ARM-on-ARM Virtualization for Multicore SystemC-TLM-Based Virtual Platforms [0.16492989697868893]
ARM-on-ARM virtual platform achieves up to 10 x speedup over traditional instruction-set-simulator-based models on compute-intensive workloads.<n>We present a multicore SystemC-TLM-based CPU model that can be used as a drop-in replacement for an instruction-set simulator.
arXiv Detail & Related papers (2025-05-19T11:21:45Z)
Tilus: A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving [12.068287973463786]
Serving Large Language Models (LLMs) is critical for AI-powered applications but demands substantial computational resources. Low-precision computation has emerged as a key technique to improve efficiency while reducing resource consumption. Existing approaches for generating low-precision kernels are limited to weight bit widths that are powers of two.
arXiv Detail & Related papers (2025-04-17T14:45:03Z)
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints [7.287566040274871]
MoE-Lens is an inference system designed through holistic performance modeling for resource-constrained environments. It captures the system execution mechanisms to identify the key hardware bottlenecks and accurately predict the achievable throughput. evaluated on diverse MoE models and datasets, MoE-Lens outperforms the state-of-the-art solution by 4.6x on average (up to 25.5x)
arXiv Detail & Related papers (2025-04-12T21:26:56Z)
XAMBA: Enabling Efficient State Space Models on Resource-Constrained Neural Processing Units [0.6063137165121326]
State-Space Models (SSMs) have emerged as efficient alternatives to transformers for sequential data tasks.<n>XAMBA is the first framework to enable and optimize SSMs on commercial off-the-shelf (COTS) state-of-the-art (SOTA) NPUs.<n>XAMBA mitigates key bottlenecks using CumBA and ReduBA, replacing sequential CumSum and ReduceSum operations with matrix-based computations.
arXiv Detail & Related papers (2025-02-10T17:33:30Z)
Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels [12.77187564450236]
We introduce XY-Serve, a versatile, Ascend native, end-to-end production large language model (LLM) serving system.<n>The core idea is an abstraction mechanism that smooths out the workload variability by decomposing computations into fine-grained meta primitives.<n>For GEMM, we introduce a virtual padding scheme that adapts to dynamic shape changes while using highly efficient GEMM primitives with assorted fixed tile sizes.
arXiv Detail & Related papers (2024-12-24T02:27:44Z)
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM. DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z)
EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation [84.70637613266835]
EoRA is a fine-tuning-free method that augments compressed Large Language Models with low-rank matrices.<n>EoRA consistently outperforms prior training-free low rank methods in recovering the accuracy of compressed LLMs.
arXiv Detail & Related papers (2024-10-28T17:59:03Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models [22.055655390093722]
We present an automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution.
arXiv Detail & Related papers (2023-07-07T17:46:08Z)
Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing. LLMs are extremely computationally expensive, even at inference time. We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels [1.304892050913381]
We introduce a new graph-based program representation for parallel applications that extends the Abstract Syntax Tree. We evaluate our proposed representation by training a Graph Neural Network (GNN) to predict the runtime of an OpenMP code region. Results show that our approach is indeed effective and has normalized RMSE as low as 0.004 to at most 0.01 in its runtime predictions.
arXiv Detail & Related papers (2023-04-07T05:52:59Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Reconfigurable Intelligent Surface Assisted Mobile Edge Computing with Heterogeneous Learning Tasks [53.1636151439562]
Mobile edge computing (MEC) provides a natural platform for AI applications. We present an infrastructure to perform machine learning tasks at an MEC with the assistance of a reconfigurable intelligent surface (RIS) Specifically, we minimize the learning error of all participating users by jointly optimizing transmit power of mobile users, beamforming vectors of the base station, and the phase-shift matrix of the RIS.
arXiv Detail & Related papers (2020-12-25T07:08:50Z)
Towards High Performance, Portability, and Productivity: Lightweight Augmented Neural Networks for Performance Prediction [0.0]
We propose lightweight augmented neural networks for arbitrary combinations of kernel-variant- hardware. We are able to obtain a low MAPE of 3%, significantly outperforming traditional feed-forward neural networks. Our variant-selection approach can be used in Halide implementations to obtain up to 1.7x speedup over Halide's auto-scheduler.
arXiv Detail & Related papers (2020-03-17T02:19:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.