SLA-Awareness for AI-assisted coding
- URL: http://arxiv.org/abs/2503.19876v1
- Date: Tue, 25 Mar 2025 17:38:28 GMT
- Title: SLA-Awareness for AI-assisted coding
- Authors: Kishanthan Thangarajah, Arthur Leung, Boyuan Chen, Ahmed E. Hassan,
- Abstract summary: We propose Coding Assistant Task Orchestrator (CATO) to serve a diverse assortment of coding tasks while meeting latency requirements and maximizing resource utilization.<n>Our experiments demonstrate that when all types of coding tasks were served simultaneously, for TTFT-critical tasks, CATO improves overall Goodput rate and resource utilization by up to 10% and 41.1%, respectively.
- Score: 6.199193051670653
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The integration of AI-assisted coding tools within development environments drastically reduces development time, and allows developers to focus more on creative and critical aspects of software engineering through the use of Code Large Language Models (CodeLLMs). These coding assistants automate repetitive and time-consuming coding tasks such as code generation, code completion, code summarization, and code translation. Responsiveness is a crucial requirement of these coding assistants to maintain real-time interactivity, such that their use does not impede the developers' workflows. Different coding tasks have unique characteristics and latency requirements: Time-To-First-Token (TTFT) latency is essential for code completion tasks, while End-To-End (E2E) latency is crucial for code translation tasks. Managing these varying requirements simultaneously while optimizing resource usage poses significant challenges. Existing work adopts the Model-as-a-Service paradigm for serving individual CodeLLMs, but cannot effectively manage latency requirements of concurrent coding tasks and sequences of CodeLLM inference calls, due to a lack of end-to-end latency awareness. Another challenge is keeping resource utilization high, when the serving system is deployed on a shared cluster environment. To address these challenges, we propose Coding Assistant Task Orchestrator (CATO), a runtime system designed to serve a diverse assortment of coding tasks while meeting latency requirements and maximizing resource utilization. Our experiments demonstrate that when all types of coding tasks were served simultaneously, for TTFT-critical tasks, CATO improves overall Goodput rate and resource utilization by up to 10% and 41.1%, respectively. P95 E2E latency was also reduced by 18% for code summarization tasks, and P95 TTFT for code generation tasks were reduced by 14% compared against state-of-the-art systems.
Related papers
- Context-Aware CodeLLM Eviction for AI-assisted Coding [6.199193051670653]
AI-assisted coding tools powered by Code Large Language Models (CodeLLMs) are increasingly integrated into modern software development.<n>To address concerns around privacy, latency, and model customization, many enterprises opt to self-host these models.<n>This paper presents CACE, a novel context-aware model eviction strategy designed specifically to optimize self-hosted CodeLLM serving under resource constraints.
arXiv Detail & Related papers (2025-06-23T16:03:32Z) - Exploring Prompt Patterns in AI-Assisted Code Generation: Towards Faster and More Effective Developer-AI Collaboration [3.1861081539404137]
This paper explores the application of structured prompt patterns to minimize the number of interactions required for satisfactory AI-assisted code generation.<n>We analyzed seven distinct prompt patterns to evaluate their effectiveness in reducing back-and-forth communication between developers and AI.
arXiv Detail & Related papers (2025-06-02T12:43:08Z) - Co-Saving: Resource Aware Multi-Agent Collaboration for Software Development [65.94639060883475]
We propose a resource-aware multi-agent system -- Co-Saving.<n>Our key innovation is the introduction of "shortcuts"<n>Compared to the state-of-the-art MAS ChatDev, our method achieves an average reduction of 50.85% in token usage.
arXiv Detail & Related papers (2025-05-28T02:23:53Z) - DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting [59.57151419673759]
Speculative decoding presents a draft-then-verify framework that reduces generation latency while maintaining output distribution fidelity.<n>We propose DuoDecoding, a novel approach that strategically deploys the draft and target models on the CPU and GPU respectively.<n>Our method incorporates a hardware-aware optimal draft budget to minimize idle times and employs dynamic multi-sequence drafting to enhance draft quality.
arXiv Detail & Related papers (2025-03-02T08:27:48Z) - LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification [42.54363549922909]
Speculative decoding has become a promising technique to mitigate the high inference latency of autoregressive decoding in Large Language Models.<n>Despite its promise, the effective application of speculative decoding in LLMs still confronts three key challenges.<n>We enhance the performance of speculative decoding in long-context settings by addressing these challenges.
arXiv Detail & Related papers (2025-02-24T18:53:31Z) - RL-RC-DoT: A Block-level RL agent for Task-Aware Video Compression [68.31184784672227]
In modern applications such as autonomous driving, an overwhelming majority of videos serve as input for AI systems performing tasks.<n>It is therefore useful to optimize the encoder for a downstream task instead of for image quality.<n>Here, we address this challenge by controlling the Quantization Parameters (QPs) at the macro-block level to optimize the downstream task.
arXiv Detail & Related papers (2025-01-21T15:36:08Z) - HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location [3.348953136575379]
HyGen is an interference-aware LLM serving system that enables efficient co-location of online and offline workloads.<n>Our evaluation on production workloads shows that HyGen achieves up to 3.87x overall throughput and 5.84x offline throughput gains.
arXiv Detail & Related papers (2025-01-15T16:32:27Z) - Agent-Driven Automatic Software Improvement [55.2480439325792]
This research proposal aims to explore innovative solutions by focusing on the deployment of agents powered by Large Language Models (LLMs)
The iterative nature of agents, which allows for continuous learning and adaptation, can help surpass common challenges in code generation.
We aim to use the iterative feedback in these systems to further fine-tune the LLMs underlying the agents, becoming better aligned to the task of automated software improvement.
arXiv Detail & Related papers (2024-06-24T15:45:22Z) - RL-GPT: Integrating Reinforcement Learning and Code-as-policy [82.1804241891039]
We introduce a two-level hierarchical framework, RL-GPT, comprising a slow agent and a fast agent.
The slow agent analyzes actions suitable for coding, while the fast agent executes coding tasks.
This decomposition effectively focuses each agent on specific tasks, proving highly efficient within our pipeline.
arXiv Detail & Related papers (2024-02-29T16:07:22Z) - MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning [28.12788291168137]
We present a multi-task fine-tuning framework, MFTcoder, that enables simultaneous and parallel fine-tuning on multiple tasks.
Experiments have conclusively demonstrated that our multi-task fine-tuning approach outperforms both individual fine-tuning on single tasks and fine-tuning on a mixed ensemble of tasks.
arXiv Detail & Related papers (2023-11-04T02:22:40Z) - Efficient Controllable Multi-Task Architectures [85.76598445904374]
We propose a multi-task model consisting of a shared encoder and task-specific decoders where both encoder and decoder channel widths are slimmable.
Our key idea is to control the task importance by varying the capacities of task-specific decoders, while controlling the total computational cost.
This improves overall accuracy by allowing a stronger encoder for a given budget, increases control over computational cost, and delivers high-quality slimmed sub-architectures.
arXiv Detail & Related papers (2023-08-22T19:09:56Z) - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware
Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks.
We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z) - Delay-aware Resource Allocation in Fog-assisted IoT Networks Through
Reinforcement Learning [22.624703832795355]
Fog nodes in the vicinity of IoT devices are promising to provision low latency services by offloading tasks from IoT devices to them.
We investigate the resource allocation problem to minimize the delay of all tasks while their constraints are satisfied.
We design an on-line reinforcement learning algorithm to make the sub-optimal decision in real time based on the system's experience replay data.
arXiv Detail & Related papers (2020-04-30T05:07:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.