FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems
- URL: http://arxiv.org/abs/2601.00227v1
- Date: Thu, 01 Jan 2026 06:18:53 GMT
- Title: FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems
- Authors: Shanli Xing, Yiyan Zhai, Alexander Jiang, Yixin Dong, Yong Wu, Zihao Ye, Charlie Ruan, Yingyi Huang, Yineng Zhang, Liangsheng Yin, Aksara Bayyapu, Luis Ceze, Tianqi Chen,
- Abstract summary: FlashInfer-Bench is a framework that connects kernel generation, benchmarking, and deployment.<n>Built on real serving traces, FlashInfer-Bench includes a curated dataset, a robust correctness- and performance-aware benchmarking framework, and a public leaderboard.<n>Using FlashInfer-Bench, we evaluate the performance and limitations of LLM agents, compare the trade-offs among different GPU programming languages, and provide insights for future agent design.
- Score: 39.33711841865621
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent advances show that large language models (LLMs) can act as autonomous agents capable of generating GPU kernels, but integrating these AI-generated kernels into real-world inference systems remains challenging. FlashInfer-Bench addresses this gap by establishing a standardized, closed-loop framework that connects kernel generation, benchmarking, and deployment. At its core, FlashInfer Trace provides a unified schema describing kernel definitions, workloads, implementations, and evaluations, enabling consistent communication between agents and systems. Built on real serving traces, FlashInfer-Bench includes a curated dataset, a robust correctness- and performance-aware benchmarking framework, a public leaderboard to track LLM agents' GPU programming capabilities, and a dynamic substitution mechanism (apply()) that seamlessly injects the best-performing kernels into production LLM engines such as SGLang and vLLM. Using FlashInfer-Bench, we further evaluate the performance and limitations of LLM agents, compare the trade-offs among different GPU programming languages, and provide insights for future agent design. FlashInfer-Bench thus establishes a practical, reproducible pathway for continuously improving AI-generated kernels and deploying them into large-scale LLM inference.
Related papers
- Towards Automated Kernel Generation in the Era of LLMs [17.69471168609145]
Kernel engineering is a time-consuming and non-scalable process.<n>Recent advances in large language models (LLMs) and agentic systems have opened new possibilities for automating kernel generation and optimization.<n>The field remains fragmented, lacking a systematic perspective for LLM-driven kernel generation.
arXiv Detail & Related papers (2026-01-22T07:53:52Z) - DABench-LLM: Standardized and In-Depth Benchmarking of Post-Moore Dataflow AI Accelerators for LLMs [18.46752801066992]
We introduce DABench-LLM, a benchmarking framework for evaluating large language models on dataflow-based accelerators.<n>We validate DABench-LLM on three commodity dataflow accelerators, Cerebras WSE-2, SambaNova RDU, and Graphcore IPU.
arXiv Detail & Related papers (2025-12-04T22:43:14Z) - AIvailable: A Software-Defined Architecture for LLM-as-a-Service on Heterogeneous and Legacy GPUs [0.5863360388454261]
We introduce AIvailable, a low-cost, highly available LLM-as-a-Service (LLM) platform.<n>It uses a software-defined approach for running LLMs across heterogeneous and legacy GPU nodes.<n>It features a unified client interface that allows seamless interaction with all deployed LLMs.
arXiv Detail & Related papers (2025-11-06T14:19:57Z) - Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z) - STARK: Strategic Team of Agents for Refining Kernels [23.717055490630596]
We introduce an agentic framework for GPU kernel optimization that explores the design space through multi-agent collaboration.<n>This framework mimics the workflow of expert engineers, enabling LLMs to reason about hardware trade-offs, incorporate profiling feedback, and refine kernels iteratively.<n>We evaluate our approach on KernelBench, a benchmark for LLM-based kernel optimization, and demonstrate substantial improvements over baseline agents.
arXiv Detail & Related papers (2025-10-19T20:41:46Z) - Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting [92.57796055887995]
We introduce ECHO, a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents.<n> ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts.<n>We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation.
arXiv Detail & Related papers (2025-10-11T18:11:09Z) - Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling [0.02091806248191979]
We introduce LIFE, a lightweight and modular analytical framework that is comprised of modular analytical model of operators.<n>LIFE characterizes the influence of software and model optimizations, such as quantization, KV cache compression, LoRA adapters, chunked prefill, different attentions, and operator fusion.<n>We validate LIFE's forecasting with inference on AMD CPUs, NPUs, iGPUs and NVIDIA V100 GPUs, with Llama2-7B variants.
arXiv Detail & Related papers (2025-07-29T03:08:31Z) - Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z) - MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering [57.156093929365255]
Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents.<n>MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios.<n>Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-05-12T17:35:43Z) - FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving [9.386969461835433]
FlashInfer is a customizable and efficient attention engine for large language models (LLMs)<n>It tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy.<n>It also offers a customizable attention template, enabling adaptation to various settings through Just-In-TimeJIT compilation.
arXiv Detail & Related papers (2025-01-02T02:02:20Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.