Related papers: ProfilingAgent: Profiling-Guided Agentic Reasoning for Adaptive Model Optimization

ProfilingAgent: Profiling-Guided Agentic Reasoning for Adaptive Model Optimization

URL: http://arxiv.org/abs/2509.05584v1
Date: Sat, 06 Sep 2025 04:02:04 GMT
Title: ProfilingAgent: Profiling-Guided Agentic Reasoning for Adaptive Model Optimization
Authors: Sadegh Jafari, Aishwarya Sarkar, Mohiuddin Bilwal, Ali Jannesari,
Abstract summary: Profiling tools expose per-layer latency, memory, and compute cost, yet are rarely integrated into automated pipelines.<n>We propose ProfilingAgent, a profiling-guided, agentic approach that uses large language models (LLMs) to automate compression via structured pruning and post-training dynamic quantization.
Score: 7.64805011214817
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Foundation models face growing compute and memory bottlenecks, hindering deployment on resource-limited platforms. While compression techniques such as pruning and quantization are widely used, most rely on uniform heuristics that ignore architectural and runtime heterogeneity. Profiling tools expose per-layer latency, memory, and compute cost, yet are rarely integrated into automated pipelines. We propose ProfilingAgent, a profiling-guided, agentic approach that uses large language models (LLMs) to automate compression via structured pruning and post-training dynamic quantization. Our modular multi-agent system reasons over static metrics (MACs, parameter counts) and dynamic signals (latency, memory) to design architecture-specific strategies. Unlike heuristic baselines, ProfilingAgent tailors layer-wise decisions to bottlenecks. Experiments on ImageNet-1K, CIFAR-10, and CIFAR-100 with ResNet-101, ViT-B/16, Swin-B, and DeiT-B/16 show pruning maintains competitive or improved accuracy (about 1% drop on ImageNet-1K, +2% gains for ViT-B/16 on smaller datasets), while quantization achieves up to 74% memory savings with <0.5% accuracy loss. Our quantization also yields consistent inference speedups of up to 1.74 times faster. Comparative studies with GPT-4o and GPT-4-Turbo highlight the importance of LLM reasoning quality for iterative pruning. These results establish agentic systems as scalable solutions for profiling-guided model optimization.

Related papers

AgenticPruner: MAC-Constrained Neural Network Compression via LLM-Driven Strategy Search [7.825137277264239]
We propose a framework to achieve Multiply-Accumulate (MAC) operation budgets through iterative strategy learning.<n>Our approach coordinates three specialized agents: a Profiling Agent that analyzes model architecture and MAC distributions, a Master Agent that orchestrates the workflow with divergence monitoring, and an Analysis Agent powered by Claude 3.5 Sonnet.<n>We validate our framework on ImageNet-1K across ResNet, ConvNeXt, and DeiT architectures.
arXiv Detail & Related papers (2026-01-18T06:07:29Z)
Lightweight Transformer Architectures for Edge Devices in Real-Time Applications [0.0]
This survey examines lightweight transformer architectures specifically designed for edge deployment.<n>We systematically review prominent lightweight variants including MobileBERT, TinyBERT, DistilBERT, EfficientFormer, EdgeFormer, and MobileViT.<n> Experimental results demonstrate that modern lightweight transformers can achieve 75-96% of full-model accuracy while reducing model size by 4-10x and inference latency by 3-9x.
arXiv Detail & Related papers (2026-01-05T01:04:25Z)
Rethinking Autoregressive Models for Lossless Image Compression via Hierarchical Parallelism and Progressive Adaptation [75.58269386927076]
Autoregressive (AR) models are often dismissed as impractical due to prohibitive computational cost.<n>This work re-thinks this paradigm, introducing a framework built on hierarchical parallelism and progressive adaptation.<n> Experiments on diverse datasets (natural, satellite, medical) validate that our method achieves new state-of-the-art compression.
arXiv Detail & Related papers (2025-11-14T06:27:58Z)
ZeroLM: Data-Free Transformer Architecture Search for Language Models [54.83882149157548]
Current automated proxy discovery approaches suffer from extended search times, susceptibility to data overfitting, and structural complexity.<n>This paper introduces a novel zero-cost proxy methodology that quantifies model capacity through efficient weight statistics.<n>Our evaluation demonstrates the superiority of this approach, achieving a Spearman's rho of 0.76 and Kendall's tau of 0.53 on the FlexiBERT benchmark.
arXiv Detail & Related papers (2025-03-24T13:11:22Z)
ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
MPruner: Optimizing Neural Network Size with CKA-Based Mutual Information Pruning [7.262751938473306]
Pruning is a well-established technique that reduces the size of neural networks while mathematically guaranteeing accuracy preservation. We develop a new pruning algorithm, MPruner, that leverages mutual information through vector similarity. MPruner achieved up to a 50% reduction in parameters and memory usage for CNN and transformer-based models, with minimal to no loss in accuracy.
arXiv Detail & Related papers (2024-08-24T05:54:47Z)
Comb, Prune, Distill: Towards Unified Pruning for Vision Model Compression [24.119415458653616]
We propose a novel unified pruning framework Comb, Prune, Distill (CPD) to address both model-agnostic and task-agnostic concerns simultaneously. Our framework employs a combing step to resolve hierarchical layer-wise dependency issues, enabling architecture independence. In image classification we achieve a speedup of up to x4.3 with a accuracy loss of 1.8% and in semantic segmentation up to x1.89 with a 5.1% loss in mIoU.
arXiv Detail & Related papers (2024-08-06T09:02:31Z)
Joint Pruning and Channel-wise Mixed-Precision Quantization for Efficient Deep Neural Networks [10.229120811024162]
deep neural networks (DNNs) pose significant challenges to their deployment on edge devices. Common approaches to address this issue are pruning and mixed-precision quantization. We propose a novel methodology to apply them jointly via a lightweight gradient-based search.
arXiv Detail & Related papers (2024-07-01T08:07:02Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.<n>We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks.<n>Experiments conducted on LLaMA, LLaMA-2, LLaMA-3, Vicuna, and Mistral models demonstrate the promising performance of our method in efficiency and effectiveness.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous Structured Pruning for Vision Transformer [76.2625311630021]
Vision transformers (ViTs) have shown very impressive empirical performance in various computer vision tasks. To mitigate this challenging problem, structured pruning is a promising solution to compress model size and enable practical efficiency. We propose GOHSP, a unified framework of Graph and Optimization-based Structured Pruning for ViT models.
arXiv Detail & Related papers (2023-01-13T00:40:24Z)
Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks. We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs. Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z)
Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage. We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction. Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.