Related papers: MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation

MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation

URL: http://arxiv.org/abs/2507.17773v2
Date: Sat, 26 Jul 2025 08:05:03 GMT
Title: MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation
Authors: Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, Tian Zhang,
Abstract summary: MultiKernelBench is a benchmark for the generation of deep learning kernels using large language models (LLMs)<n>It spans 285 tasks across 14 well-defined kernel categories and supports three major hardware platforms.<n>We show significant variation in task difficulty, poor generalization to platforms with less training exposure, and the effectiveness of targeted prompting strategies.
Score: 17.461533973039064
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The automatic generation of deep learning (DL) kernels using large language models (LLMs) has emerged as a promising approach to reduce the manual effort and hardware-specific expertise required for writing high-performance operator implementations. However, existing benchmarks for evaluating LLMs in this domain suffer from limited hardware support, coarse-grained kernel categorization, and imbalanced task coverage. To address these limitations, we introduce MultiKernelBench, the first comprehensive, multi-platform benchmark for LLM-based DL kernel generation. MultiKernelBench spans 285 tasks across 14 well-defined kernel categories and supports three major hardware platforms: Nvidia GPUs, Huawei NPUs, and Google TPUs. To enable future extensibility, we design a modular backend abstraction layer that decouples platform-specific logic from the core benchmarking infrastructure, allowing easy integration of new hardware platforms. We further propose a simple yet effective category-aware one-shot prompting method that improves generation quality by providing in-category exemplars. Through systematic evaluations of seven state-of-the-art LLMs, we reveal significant variation in task difficulty, poor generalization to platforms with less training exposure, and the effectiveness of targeted prompting strategies. MultiKernelBench is publicly available at https://github.com/wzzll123/MultiKernelBench.

Related papers

Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
GPU Performance Portability needs Autotuning [0.0]
LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware.<n>We make the case for combining just-in-time (JIT) compilation with comprehensive kernel parameter autotuning.<n>Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors.
arXiv Detail & Related papers (2025-04-30T12:57:21Z)
Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing [53.295515505026096]
Janus-Pro-driven Prompt Parsing is a prompt- parsing module that bridges text understanding and layout generation.<n>MIGLoRA is a parameter-efficient plug-in integrating Low-Rank Adaptation into UNet (SD1.5) and DiT (SD3) backbones.<n>The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency.
arXiv Detail & Related papers (2025-03-27T00:59:14Z)
BYOS: Knowledge-driven Large Language Models Bring Your Own Operating System More Excellent [32.81416809245337]
kernel tuning involves systematically adjusting kernel configurations to optimize system performance.<n>Despite recent advancements in large language models (LLMs), kernel tuning remains a critical challenge.<n>We propose BYOS, a framework that automates a LLM-powered framework for kernel tuning.
arXiv Detail & Related papers (2025-03-12T15:50:16Z)
KernelBench: Can LLMs Write Efficient GPU Kernels? [36.4117525096377]
KernelBench is an open-source framework for evaluating language models' ability to write fast and correct kernels.<n>We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct.<n>Our experiments show that frontier reasoning models perform the best out of the box but still fall short overall.
arXiv Detail & Related papers (2025-02-14T19:30:53Z)
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators [1.1028525384019312]
Large Language Models (LLMs) have propelled groundbreaking advancements across several domains and are commonly used for text generation applications. We introduce LLM-Inference-Bench, a comprehensive benchmarking suite to evaluate the hardware inference performance of LLMs. Our benchmarking results reveal the strengths and limitations of various models, hardware platforms, and inference frameworks.
arXiv Detail & Related papers (2024-10-31T18:34:59Z)
Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective [32.827076621809965]
Large Language Models (LLMs) have demonstrated remarkable capabilities across various fields.<n>LLMs like GPT series and Llama series are currently the main focus due to their superior algorithmic performance.<n>Various hardware platforms exhibit distinct hardware characteristics, which can help improve LLM inference performance.
arXiv Detail & Related papers (2024-10-06T12:42:04Z)
NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks. We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
ULLME: A Unified Framework for Large Language Model Embeddings with Generation-Augmented Learning [72.90823351726374]
We introduce the Unified framework for Large Language Model Embedding (ULLME), a flexible, plug-and-play implementation that enables bidirectional attention across various LLMs. We also propose Generation-augmented Representation Learning (GRL), a novel fine-tuning method to boost LLMs for text embedding tasks. To showcase our framework's flexibility and effectiveness, we release three pre-trained models from ULLME with different backbone architectures.
arXiv Detail & Related papers (2024-08-06T18:53:54Z)
Benchmarking Predictive Coding Networks -- Made Simple [48.652114040426625]
We tackle the problems of efficiency and scalability for predictive coding networks (PCNs) in machine learning.<n>We propose a library, called PCX, that focuses on performance and simplicity, and use it to implement a large set of standard benchmarks.<n>We perform extensive tests on such benchmarks using both existing algorithms for PCNs, as well as adaptations of other methods popular in the bio-plausible deep learning community.
arXiv Detail & Related papers (2024-07-01T10:33:44Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Our multi-scale linear attention achieves the global receptive field and multi-scale learning. EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.