Related papers: T-TAMER: Provably Taming Trade-offs in ML Serving

T-TAMER: Provably Taming Trade-offs in ML Serving

URL: http://arxiv.org/abs/2509.22992v1
Date: Fri, 26 Sep 2025 23:08:03 GMT
Title: T-TAMER: Provably Taming Trade-offs in ML Serving
Authors: Yuanyuan Yang, Ruimin Zhang, Jamie Morgenstern, Haifeng Xu,
Abstract summary: We present a general framework, T-Tamer, which formalizes this setting as a multi-stage decision process.<n>Our main result shows that recall is both necessary and sufficient for achieving provable performance guarantees.<n>The results show that recall-based strategies consistently yield efficient accuracy-latency trade-offs.
Score: 32.526955555483354
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As machine learning models continue to grow in size and complexity, efficient serving faces increasingly broad trade-offs spanning accuracy, latency, resource usage, and other objectives. Multi-model serving further complicates these trade-offs; for example, in cascaded models, each early-exit decision balances latency reduction against potential accuracy loss. Despite the pervasiveness and importance of such trade-offs, current strategies remain largely heuristic and case-specific, limiting both their theoretical guarantees and general applicability. We present a general framework, T-Tamer, which formalizes this setting as a multi-stage decision process, where the objective is to determine both when to exit and which model to consult. Our main result shows that recall (i.e., the ability to revisit earlier models) is both necessary and sufficient for achieving provable performance guarantees. In particular, we prove that strategies without recall cannot obtain any constant-factor approximation to the optimal trade-off, whereas recall-based strategies provably attain the optimal trade-off in polynomial time. We validate our analysis through experiments on synthetic datasets and early-exit workloads for vision and NLP benchmarks. The results show that recall-based strategies consistently yield efficient accuracy-latency trade-offs. We hope this work provides a principled foundation for bridging heuristic practice with theoretical guarantees in the design of early-exit and cascaded models.

Related papers

Observationally Informed Adaptive Causal Experimental Design [55.998153710215654]
We propose Active Residual Learning, a new paradigm that leverages the observational model as a foundational prior.<n>This approach shifts the experimental focus from learning target causal quantities from scratch to efficiently estimating the residuals required to correct observational bias.<n> Experiments on synthetic and semi-synthetic benchmarks demonstrate that R-Design significantly outperforms baselines.
arXiv Detail & Related papers (2026-03-04T06:52:37Z)
CoT-Saliency: Unified Chain-of-Thought Reasoning for Heterogeneous Saliency Tasks [96.64597365827046]
We present the first unified framework that jointly handles three operationally heterogeneous saliency tasks.<n>We introduce a Chain-of-Thought (CoT) reasoning process in a Vision-Language Model (VLM) to bridge task heterogeneity.<n>We show our model matches or outperforms specialized SOTA methods and strong closed-source VLMs across all tasks.
arXiv Detail & Related papers (2025-11-01T04:37:01Z)
Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models [33.214586668992965]
Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning.<n>We propose RECAP-a replay strategy with dynamic objective reweighting for general knowledge.<n>Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning.
arXiv Detail & Related papers (2025-10-24T19:08:48Z)
SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs [53.77646961962239]
Supervised Fine-Tuning (SFT) is a common approach to adapt Large Language Models (LLMs) to specialized tasks.<n>We show that SFT does not always hurt: using a smaller learning rate can substantially mitigate general performance degradation.
arXiv Detail & Related papers (2025-09-25T05:28:22Z)
End-to-End Large Portfolio Optimization for Variance Minimization with Neural Networks through Covariance Cleaning [0.0]
We develop a rotation-invariant neural network that provides the global minimum-variance portfolio.<n>This explicit mathematical mapping offers clear interpretability of each module's role.<n>A single model can be calibrated on panels of a few hundred stocks and applied, without retraining, to one thousand US equities.
arXiv Detail & Related papers (2025-07-02T17:27:29Z)
Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling [48.15636223774418]
Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness.<n>We propose the Explicit Knowledge Boundary Modeling framework to integrate fast and slow reasoning systems to harmonize reliability and usability.
arXiv Detail & Related papers (2025-03-04T03:16:02Z)
LoRanPAC: Low-rank Random Features and Pre-trained Models for Bridging Theory and Practice in Continual Learning [103.45785408116146]
Continual learning (CL) aims to train a model that can solve multiple tasks presented sequentially.<n>Recent CL approaches have achieved strong performance by leveraging large pre-trained models that generalize well to downstream tasks.<n>However, such methods lack theoretical guarantees, making them prone to unexpected failures.<n>We aim to bridge this gap by designing a simple CL method that is theoretically sound and highly performant.
arXiv Detail & Related papers (2024-10-01T12:58:37Z)
On the Generalization of Preference Learning with DPO [17.420727709895736]
Large language models (LLMs) have demonstrated remarkable capabilities but often struggle to align with human preferences.<n> Preference learning trains models to distinguish between preferred and non-preferred responses based on human feedback.<n>This paper introduces a new theoretical framework to analyze the generalization guarantees of models trained with direct preference optimization (DPO)
arXiv Detail & Related papers (2024-08-06T22:11:00Z)
Integrating Fairness and Model Pruning Through Bi-level Optimization [16.213634992886384]
We introduce a novel concept of fair model pruning, which involves developing a sparse model that adheres to fairness criteria.<n>In particular, we propose a framework to jointly optimize the pruning mask and weight update processes with fairness constraints.<n>This framework is engineered to compress models that maintain performance while ensuring fairness in a unified process.
arXiv Detail & Related papers (2023-12-15T20:08:53Z)
Precision-Recall Divergence Optimization for Generative Modeling with GANs and Normalizing Flows [54.050498411883495]
We develop a novel training method for generative models, such as Generative Adversarial Networks and Normalizing Flows. We show that achieving a specified precision-recall trade-off corresponds to minimizing a unique $f$-divergence from a family we call the textitPR-divergences. Our approach improves the performance of existing state-of-the-art models like BigGAN in terms of either precision or recall when tested on datasets such as ImageNet.
arXiv Detail & Related papers (2023-05-30T10:07:17Z)
Trust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning [123.3472310767721]
Prediction credibility measures are fundamental in statistics and machine learning. These measures should account for the wide variety of models used in practice. The framework developed in this work expresses the credibility as a risk-fit trade-off.
arXiv Detail & Related papers (2020-11-24T19:52:38Z)
Accurate and Robust Feature Importance Estimation under Distribution Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method. We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.