Patchwork: A Unified Framework for RAG Serving
- URL: http://arxiv.org/abs/2505.07833v1
- Date: Thu, 01 May 2025 18:58:26 GMT
- Title: Patchwork: A Unified Framework for RAG Serving
- Authors: Bodun Hu, Luis Pabon, Saurabh Agarwal, Aditya Akella,
- Abstract summary: Retrieval Augmented Generation (RAG) has emerged as a new paradigm for enhancing Large Language Model reliability through integration with external knowledge sources.<n>We introduce Patchwork, a comprehensive end-to-end RAG serving framework designed to address these efficiency bottlenecks.
- Score: 6.430565435912026
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval Augmented Generation (RAG) has emerged as a new paradigm for enhancing Large Language Model reliability through integration with external knowledge sources. However, efficient deployment of these systems presents significant technical challenges due to their inherently heterogeneous computational pipelines comprising LLMs, databases, and specialized processing components. We introduce Patchwork, a comprehensive end-to-end RAG serving framework designed to address these efficiency bottlenecks. Patchwork's architecture offers three key innovations: First, it provides a flexible specification interface enabling users to implement custom RAG pipelines. Secondly, it deploys these pipelines as distributed inference systems while optimizing for the unique scalability characteristics of individual RAG components. Third, Patchwork incorporates an online scheduling mechanism that continuously monitors request load and execution progress, dynamically minimizing SLO violations through strategic request prioritization and resource auto-scaling. Our experimental evaluation across four distinct RAG implementations demonstrates that Patchwork delivers substantial performance improvements over commercial alternatives, achieving throughput gains exceeding 48% while simultaneously reducing SLO violations by ~24%.
Related papers
- Architecture-Aware Multi-Design Generation for Repository-Level Feature Addition [53.50448142467294]
RAIM is a multi-design and architecture-aware framework for repository-level feature addition.<n>It shifts away from linear patching by generating multiple diverse implementation designs.<n>Experiments on the NoCode-bench Verified dataset demonstrate that RAIM establishes a new state-of-the-art performance.
arXiv Detail & Related papers (2026-03-02T12:50:40Z) - Relatron: Automating Relational Machine Learning over Relational Databases [50.94254514286021]
We present a study that unifies RDL and DFS in a shared design space and conducts architecture-centric searches across diverse RDB tasks.<n>Our analysis yields three key findings: (1) RDL does not consistently outperform DFS, with performance being highly task-dependent; (2) no single architecture dominates across tasks, underscoring the need for task-aware model selection; and accuracy is an unreliable guide for choice architecture.
arXiv Detail & Related papers (2026-02-26T02:45:22Z) - Real-Time Lane Detection via Efficient Feature Alignment and Covariance Optimization for Low-Power Embedded Systems [22.603468261037975]
Real-time lane detection in embedded systems faces significant challenges due to subtle and sparse visual signals in RGB images.<n>We propose an innovative Covariance Distribution Optimization (CDO) module specifically designed for efficient, real-time applications.<n>CDO module aligns lane feature distributions closely with ground-truth labels, significantly enhancing detection accuracy without increasing computational complexity.
arXiv Detail & Related papers (2026-01-05T00:06:06Z) - SCoTER: Structured Chain-of-Thought Transfer for Enhanced Recommendation [24.019381388104236]
We propose SCoTER, a unified framework that treats pattern discovery and structure-aware transfer as a jointly optimized problem.<n>Specifically, SCoTER operationalizes this through two synergistic components: a GVM pipeline for automated pattern discovery and a structure-preserving integration architecture that transfers stepwise logic to efficient models.
arXiv Detail & Related papers (2025-11-24T03:00:04Z) - Experience-Guided Adaptation of Inference-Time Reasoning Strategies [49.954515048847874]
Experience-Guided Reasoner (EGuR) generates tailored strategies at inference time based on accumulated experience.<n>EGuR achieves up to 14% accuracy improvements over the strongest baselines while reducing computational costs by up to 111x.
arXiv Detail & Related papers (2025-11-14T17:45:28Z) - Retrieval-Augmented Multi-LLM Ensemble for Industrial Part Specification Extraction [0.0]
This paper introduces a retrieval-augmented multi-LLM ensemble framework that orchestrates nine state-of-the-art Large Language Models (LLMs)<n>RAGsemble addresses key limitations of single-model systems by combining the complementary strengths of model families including Gemini (2.0, 2.5, 1.5), OpenAI (GPT-4o, o4-mini), Mistral Large, and Gemma (1B, 4B, 3n-e4b), while grounding outputs in factual data using FAISS-based semantic retrieval.
arXiv Detail & Related papers (2025-11-08T14:43:20Z) - VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use [78.29315418819074]
We introduce VerlTool, a unified and modular framework that addresses limitations through systematic design principles.<n>Our framework formalizes ARLT as multi-turn trajectories with multi-modal observation tokens (text/image/video), extending beyond single-turn RLVR paradigms.<n>The modular plugin architecture enables rapid tool integration requiring only lightweight Python definitions.
arXiv Detail & Related papers (2025-09-01T01:45:18Z) - Learning Unified User Quantized Tokenizers for User Representation [33.38662746945411]
U2QT (Unified User Quantized Tokenizers) is a novel framework that integrates cross-domain knowledge transfer with early fusion of heterogeneous domains.<n>Our framework employs a two-stage architecture: first, a causal Q-Former projects domain-specific features into a shared causal representation space to preserve inter-modality dependencies; second, a multi-view RQ-VAE discretizes causal embeddings into compact tokens through shared and source-specific codebooks, enabling efficient storage while maintaining semantic coherence.
arXiv Detail & Related papers (2025-08-01T08:35:32Z) - NAP-Tuning: Neural Augmented Prompt Tuning for Adversarially Robust Vision-Language Models [72.58372335140241]
Adversarial Prompt Tuning (AdvPT) introduced learnable text prompts to enhance adversarial robustness in Vision-Language Models (VLMs)<n>We present the Neural Augmentor framework for Multi-modal Adversarial Prompt Tuning (NAP-Tuning)<n>Our approach shows significant improvements over the strongest baselines under the challenging AutoAttack benchmark, outperforming them by 33.5% on ViT-B16 and 33.0% on ViT-B32 architectures.
arXiv Detail & Related papers (2025-06-15T03:34:23Z) - CF-DETR: Coarse-to-Fine Transformer for Real-Time Object Detection [3.002784388635573]
CF-DETR is an integrated system featuring a novel coarse-to-fine Transformer architecture and a dedicated real-time scheduling framework NPFP**.<n>It partitions each DETR task into a safety-critical coarse subtask for guaranteed critical object detection within its deadline, and an optional fine subtask for enhanced overall accuracy (R2)<n>Under an NPFP** policy, CF-DETR successfully meets strict timing guarantees for critical operations and achieves significantly higher overall and critical object detection accuracy compared to existing baselines across diverse AV workloads.
arXiv Detail & Related papers (2025-05-29T10:23:37Z) - Vision-Centric Representation-Efficient Fine-Tuning for Robust Universal Foreground Segmentation [5.326302374594885]
Foreground segmentation is crucial for scene understanding, yet parameter-efficient fine-tuning (PEFT) of vision foundation models (VFMs) often fails in complex scenarios.<n>We propose Ladder Shape-bias Representation Side-tuning (LSR-ST), a lightweight PEFT framework that enhances model robustness by introducing shape-biased inductive priors.
arXiv Detail & Related papers (2025-04-20T04:12:38Z) - Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing [53.295515505026096]
Janus-Pro-driven Prompt Parsing is a prompt- parsing module that bridges text understanding and layout generation.<n>MIGLoRA is a parameter-efficient plug-in integrating Low-Rank Adaptation into UNet (SD1.5) and DiT (SD3) backbones.<n>The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency.
arXiv Detail & Related papers (2025-03-27T00:59:14Z) - LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding [4.759109475818876]
Implicit Neural Representations (INRs) are proving to be a powerful paradigm in unifying task modeling across diverse data domains.<n>We introduce LIFT, a novel, high-performance framework that captures multiscale information through meta-learning.<n>We also introduce ReLIFT, an enhanced variant of LIFT that incorporates residual connections and expressive frequency encodings.
arXiv Detail & Related papers (2025-03-19T17:00:58Z) - RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving [9.962031642362813]
Retrieval-augmented generation (RAG) is emerging as a popular approach for reliable LLM serving.<n>RAG is a structured abstraction that captures the wide range of RAG algorithms.<n> RAGO is a system optimization framework for efficient RAG serving.
arXiv Detail & Related papers (2025-03-18T18:58:13Z) - DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [55.13854171147104]
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development.<n>We present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents.<n>We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2.
arXiv Detail & Related papers (2025-03-18T14:02:59Z) - ModServe: Scalable and Resource-Efficient Large Multimodal Model Serving [19.388562622309838]
Large multimodal models (LMMs) demonstrate impressive capabilities in understanding images, videos, and audio beyond text.<n>We present the first comprehensive systems analysis of two prominent LMM architectures, decoder-only and cross-attention, across six representative open-source models.<n>We propose ModServe, a modular LMM serving system that decouples stages for independent optimization and adaptive scaling.
arXiv Detail & Related papers (2025-02-02T22:10:40Z) - A Refreshed Similarity-based Upsampler for Direct High-Ratio Feature Upsampling [54.05517338122698]
A popular similarity-based feature upsampling pipeline has been proposed, which utilizes a high-resolution feature as guidance.<n>We propose an explicitly controllable query-key feature alignment from both semantic-aware and detail-aware perspectives.<n>We develop a fine-grained neighbor selection strategy on HR features, which is simple yet effective for alleviating mosaic artifacts.
arXiv Detail & Related papers (2024-07-02T14:12:21Z) - A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems [67.52782366565658]
State-of-the-art recommender systems (RSs) depend on categorical features, which ecoded by embedding vectors, resulting in excessively large embedding tables.<n>Despite the prosperity of lightweight embedding-based RSs, a wide diversity is seen in evaluation protocols.<n>This study investigates various LERS' performance, efficiency, and cross-task transferability via a thorough benchmarking process.
arXiv Detail & Related papers (2024-06-25T07:45:00Z) - Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR)
CFSR inherits the advantages of both convolution-based and transformer-based approaches.
Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.