Related papers: ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler

ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler

URL: http://arxiv.org/abs/2601.20755v2
Date: Thu, 29 Jan 2026 10:43:56 GMT
Title: ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler
Authors: Bohua Zou, Debayan Roy, Dhimankumar Yogesh Airao, Weihao Xu, Binqi Sun, Yutao Liu, Haibo Chen,
Abstract summary: We develop a fine-grained, non-intrusive profiling framework for modern inference engines.<n>Our system attaches probes to runtime functions across multiple layers -- without modifying or recompiling the source.<n>It transforms collected traces into rich visualizations of operators, graphs, timelines, and hardware counter trends.
Score: 4.191309912359899
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) move from research to production, understanding how inference engines behave in real time has become both essential and elusive. Unlike general-purpose engines such as ONNX Runtime, today's LLM inference systems offer little operator-level visibility, leaving developers blind to where time and resources go. Even basic questions -- is this workload memory-bound or compute-bound? -- often remain unanswered. To close this gap, we develop a fine-grained, non-intrusive profiling framework for modern LLM inference engines, exemplified by llama-cpp but applicable to similar runtime architectures. Built on extended Berkeley Packet Filter (eBPF) technology, our system dynamically attaches probes to runtime functions across multiple layers -- without modifying or recompiling the source. It transforms collected traces into rich visualizations of operators, graphs, timelines, and hardware counter trends, exposing how dense inference, Mixture-of-Experts routing, and operator offloading behave in practice. With less than 4% runtime overhead and high profiling fidelity, our framework makes LLM inference both transparent and diagnosable, turning performance profiling into a practical tool for optimization, scheduling, and resource-aware deployment.

Related papers

AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving [16.664502126572856]
AIConfigurator is a unified performance-modeling system for Large Language Model (LLM) inference.<n>It enables rapid, framework-a configuration search without requiring GPU-based profiling.<n>It identifies superior serving configurations that improve performance by up to 40% for dense models.
arXiv Detail & Related papers (2026-01-09T20:03:57Z)
Optimizing Agentic Language Model Inference via Speculative Tool Calls [4.106903307413157]
We introduce novel systems optimizations to address performance bottlenecks during the inference process.<n>Our optimizations lead to throughput improvements of several hundred tokens per second when hosting inference for LM agents.<n>We recommend a new "tool cache" API endpoint to enable LM providers to easily adopt these optimizations.
arXiv Detail & Related papers (2025-12-17T18:22:44Z)
AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents [27.864519204078004]
Large language models (LLMs) have shown impressive performance in general programming tasks.<n>We introduce AutoMLGen, an LLM-based coding agent that integrates a domain knowledge base for high-quality prior guidance.<n>We show that AutoMLGen achieves state-of-the-art performance in numerous dimensions, such as the average medal rate and the valid submission rate.
arXiv Detail & Related papers (2025-10-09T17:45:05Z)
Semantic-Aware Scheduling for GPU Clusters with Large Language Models [60.14838697778884]
We propose SchedMate, a framework that bridges the semantic gap between schedulers and jobs they manage.<n>SchedMate extracts deep insights from overlooked, unstructured data sources: source code, runtime logs, and historical jobs.<n>We show SchedMate reduces average job completion times by up to 1.91x, substantially enhancing the scheduling performance.
arXiv Detail & Related papers (2025-10-02T02:01:02Z)
Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
LLM-AutoDiff: Auto-Differentiate Any LLM Workflow [58.56731133392544]
We introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering (APE)<n>LLMs-AutoDiff treats each textual input as a trainable parameter and uses a frozen backward engine to generate feedback-akin to textual gradients.<n>It consistently outperforms existing textual gradient baselines in both accuracy and training cost.
arXiv Detail & Related papers (2025-01-28T03:18:48Z)
MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines [28.18421624702502]
We introduce MobiZO, a resource-efficient fine-tuning framework for Large Language Models (LLMs) specifically designed for edge devices.<n>We show that MobiZO achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy.<n> Experiments demonstrate that MobiZO achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy.
arXiv Detail & Related papers (2024-09-23T20:14:09Z)
The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines [6.381783966294295]
Open-source large language models (LLMs) enable developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance. We analyze the performance, particularly the throughput (tokens generated per unit of time) of 20 LLMs using two inference libraries: vLLM and HuggingFace's pipelines.
arXiv Detail & Related papers (2024-08-02T06:56:59Z)
LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges. Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model. This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z)
Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models [26.2566707495948]
Large Language Models (LLMs) have seen great advance in both academia and industry. We benchmark the end-to-end performance of pre-training, fine-tuning, and serving LLMs in different sizes. Then, we dive deeper to provide a detailed runtime analysis of the sub-modules, including computing and communication operators in LLMs.
arXiv Detail & Related papers (2023-11-07T03:25:56Z)
In Situ Framework for Coupling Simulation and Machine Learning with Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations. As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks. This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.