Meeting SLOs, Slashing Hours: Automated Enterprise LLM Optimization with OptiKIT
- URL: http://arxiv.org/abs/2601.20408v1
- Date: Wed, 28 Jan 2026 09:13:17 GMT
- Title: Meeting SLOs, Slashing Hours: Automated Enterprise LLM Optimization with OptiKIT
- Authors: Nicholas Santavas, Kareem Eissa, Patrycja Cieplicka, Piotr Florek, Matteo Nulli, Stefan Vasilev, Seyyed Hadi Hashemi, Antonios Gasteratos, Shahram Khadivi,
- Abstract summary: We present OptiKIT, a distributed LLM optimization framework.<n>It democratizes model compression and tuning by automating complex optimization for non-expert teams.<n>In production, it delivers more than 2x GPU throughput improvement.
- Score: 7.499397261133133
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Enterprise LLM deployment faces a critical scalability challenge: organizations must optimize models systematically to scale AI initiatives within constrained compute budgets, yet the specialized expertise required for manual optimization remains a niche and scarce skillset. This challenge is particularly evident in managing GPU utilization across heterogeneous infrastructure while enabling teams with diverse workloads and limited LLM optimization experience to deploy models efficiently. We present OptiKIT, a distributed LLM optimization framework that democratizes model compression and tuning by automating complex optimization workflows for non-expert teams. OptiKIT provides dynamic resource allocation, staged pipeline execution with automatic cleanup, and seamless enterprise integration. In production, it delivers more than 2x GPU throughput improvement while empowering application teams to achieve consistent performance improvements without deep LLM optimization expertise. We share both the platform design and key engineering insights into resource allocation algorithms, pipeline orchestration, and integration patterns that enable large-scale, production-grade democratization of model optimization. Finally, we open-source the system to enable external contributions and broader reproducibility.
Related papers
- LLM for Large-Scale Optimization Model Auto-Formulation: A Lightweight Few-Shot Learning Approach [10.44190976207354]
LEAN-LLM-OPT is a workflow framework for large-scale OPTimization auto-formulation.<n>It decomposes the modeling task into structured sub-tasks and offloads mechanical data-handling operations to auxiliary tools.<n>It achieves strong performance on large-scale optimization modeling tasks and is competitive with state-of-the-art approaches.
arXiv Detail & Related papers (2026-01-14T17:09:57Z) - Rethinking LLM-Driven Heuristic Design: Generating Efficient and Specialized Solvers via Dynamics-Aware Optimization [21.449921296295884]
We propose Dynamics-Aware Heuristics (DASH), a framework that co-optimizes solver search mechanisms and runtime schedules guided by a convergence-aware metric.<n>DASH improves runtime efficiency by over 3 times, while surpassing the solution quality of state-of-the-art baselines across diverse problem scales.
arXiv Detail & Related papers (2026-01-14T05:06:42Z) - xLLM Technical Report [57.13120905321185]
We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework.<n>xLLM builds a novel decoupled service-engine architecture.<n>xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources.
arXiv Detail & Related papers (2025-10-16T13:53:47Z) - LLM4CMO: Large Language Model-aided Algorithm Design for Constrained Multiobjective Optimization [54.35609820607923]
Large language models (LLMs) offer new opportunities for assisting with algorithm design.<n>We propose LLM4CMO, a novel CMOEA based on a dual-population, two-stage framework.<n>LLMs can serve as efficient co-designers in the development of complex evolutionary optimization algorithms.
arXiv Detail & Related papers (2025-08-16T02:00:57Z) - LADs: Leveraging LLMs for AI-Driven DevOps [3.240228178267042]
LADs is a principled approach to configuration optimization through in-depth analysis of what optimization works under which conditions.<n>By leveraging Retrieval-Augmented Generation, Few-Shot Learning, Chain-of-Thought, and Feedback-Based Prompt Chaining, LADs generates accurate configurations and learns from deployment failures to iteratively refine system settings.<n>Our findings reveal key insights into the trade-offs between performance, cost, and scalability, helping practitioners determine the right strategies for different deployment scenarios.
arXiv Detail & Related papers (2025-02-28T08:12:08Z) - IMPROVE: Iterative Model Pipeline Refinement and Optimization Leveraging LLM Experts [28.9807389592324]
Large language model (LLM) agents have emerged as a promising solution to automate the workflow of machine learning.<n>We introduce Iterative Refinement, a novel strategy for LLM-driven ML pipeline design inspired by how human ML experts iteratively refine models.<n>By systematically updating individual components based on real training feedback, Iterative Refinement improves overall model performance.
arXiv Detail & Related papers (2025-02-25T01:52:37Z) - DNN-Powered MLOps Pipeline Optimization for Large Language Models: A Framework for Automated Deployment and Resource Management [0.0]
This research presents a novel framework that leverages Deep Neural Networks (DNNs) to optimize MLOps pipelines for Large Language Models (LLMs)<n>Our approach introduces an intelligent system that automates deployment decisions, resource allocation, and pipeline optimization while maintaining optimal performance and cost efficiency.
arXiv Detail & Related papers (2025-01-14T14:15:32Z) - Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models [54.381650481255235]
We introduce a new tuning-free approach for self-alignment, Dynamic Rewarding with Prompt Optimization (O)
Our approach leverages a search-based optimization framework that allows LLMs to iteratively self-improve and craft the optimal alignment instructions.
Empirical evaluations on eight recent LLMs, both open and closed-sourced, demonstrate that DRPO significantly enhances alignment performance.
arXiv Detail & Related papers (2024-11-13T16:15:38Z) - Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - Large Language Model as a Catalyst: A Paradigm Shift in Base Station Siting Optimization [62.16747639440893]
Large language models (LLMs) and their associated technologies advance, particularly in the realms of prompt engineering and agent engineering.<n>Our proposed framework incorporates retrieval-augmented generation (RAG) to enhance the system's ability to acquire domain-specific knowledge and generate solutions.
arXiv Detail & Related papers (2024-08-07T08:43:32Z) - A Problem-Oriented Perspective and Anchor Verification for Code Optimization [43.28045750932116]
Large language models (LLMs) have shown remarkable capabilities in solving various programming tasks.<n>This paper investigates the capabilities of LLMs in optimizing code for minimal execution time.
arXiv Detail & Related papers (2024-06-17T16:10:10Z) - ORLM: A Customizable Framework in Training Large Models for Automated Optimization Modeling [15.67321902882617]
We propose a viable path for training open-source LLMs capable of optimization modeling and developing solver codes.<n>This work also introduces IndustryOR, the first industrial benchmark for evaluating LLMs in solving practical OR problems.
arXiv Detail & Related papers (2024-05-28T01:55:35Z) - Machine Learning Insides OptVerse AI Solver: Design Principles and
Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver.
We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem.
We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.