Compressed Models are NOT Trust-equivalent to Their Large Counterparts
- URL: http://arxiv.org/abs/2508.13533v1
- Date: Tue, 19 Aug 2025 05:49:39 GMT
- Title: Compressed Models are NOT Trust-equivalent to Their Large Counterparts
- Authors: Rohit Raj Rai, Chirag Kothari, Siddhesh Shelke, Amit Awekar,
- Abstract summary: Large Deep Learning models are often compressed before being deployed in a resource-constrained environment.<n>Can we trust the prediction of compressed models just as we trust the prediction of the original large model?<n>We propose a two-dimensional framework for trust-equivalence evaluation.
- Score: 0.8124699127636158
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Deep Learning models are often compressed before being deployed in a resource-constrained environment. Can we trust the prediction of compressed models just as we trust the prediction of the original large model? Existing work has keenly studied the effect of compression on accuracy and related performance measures. However, performance parity does not guarantee trust-equivalence. We propose a two-dimensional framework for trust-equivalence evaluation. First, interpretability alignment measures whether the models base their predictions on the same input features. We use LIME and SHAP tests to measure the interpretability alignment. Second, calibration similarity measures whether the models exhibit comparable reliability in their predicted probabilities. It is assessed via ECE, MCE, Brier Score, and reliability diagrams. We conducted experiments using BERT-base as the large model and its multiple compressed variants. We focused on two text classification tasks: natural language inference and paraphrase identification. Our results reveal low interpretability alignment and significant mismatch in calibration similarity. It happens even when the accuracies are nearly identical between models. These findings show that compressed models are not trust-equivalent to their large counterparts. Deploying compressed models as a drop-in replacement for large models requires careful assessment, going beyond performance parity.
Related papers
- SURE: Semi-dense Uncertainty-REfined Feature Matching [28.68008638977835]
SURE is a Semi- dense Uncertainty-REfined matching framework that jointly predicts correspondences and their confidence.<n>Our approach in- troduces a novel evidential head for trustworthy coordinate regression, along with a lightweight spatial fusion module.<n>Our method consistently outperforms existing state-of-the-art semi-dense matching models in both accuracy and efficiency.
arXiv Detail & Related papers (2026-03-05T06:53:11Z) - Downsized and Compromised?: Assessing the Faithfulness of Model Compression [0.0]
This paper presents a novel approach to evaluating faithfulness in compressed models, moving beyond standard metrics.<n>We introduce and demonstrate a set of faithfulness metrics that capture how model behavior changes post-compression.<n>Our contributions include introducing techniques to assess predictive consistency between the original and compressed models using model agreement, and applying chi-squared tests to detect statistically significant changes in predictive patterns across both the overall dataset and demographic subgroups.
arXiv Detail & Related papers (2025-10-07T17:05:02Z) - Quantifying the Reliability of Predictions in Detection Transformers: Object-Level Calibration and Image-Level Uncertainty [6.209833978040362]
In practice, DETR generates hundreds of predictions that far outnumber the actual number of objects present in an image.<n>This raises the question: can we trust and use all of these predictions?<n>We present empirical evidence highlighting how different predictions within the same image play distinct roles, resulting in varying reliability levels.
arXiv Detail & Related papers (2024-12-02T18:34:17Z) - Evaluating Model Bias Requires Characterizing its Mistakes [19.777130236160712]
We introduce SkewSize, a principled and flexible metric that captures bias from mistakes in a model's predictions.
It can be used in multi-class settings or generalised to the open vocabulary setting of generative models.
We demonstrate the utility of SkewSize in multiple settings including: standard vision models trained on synthetic data, vision models trained on ImageNet, and large scale vision-and-language models from the BLIP-2 family.
arXiv Detail & Related papers (2024-07-15T11:46:21Z) - Accuracy is Not All You Need [9.371810162601623]
We conduct a detailed study of metrics across multiple compression techniques, models and datasets.
We show that the behavior of compressed models as visible to end-users is significantly different from the baseline model, even when accuracy is similar.
We propose two such metrics, KL-Divergence and flips, and show that they are well correlated.
arXiv Detail & Related papers (2024-07-12T10:19:02Z) - Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency.
Results show that consistency-based calibration methods outperform existing post-hoc approaches.
We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z) - Towards Calibrated Robust Fine-Tuning of Vision-Language Models [97.19901765814431]
This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models.
We show that both OOD classification and OOD calibration errors have a shared upper bound consisting of two terms of ID data.
Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value.
arXiv Detail & Related papers (2023-11-03T05:41:25Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [49.15931834209624]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.<n>We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.<n>By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - Confidence and Dispersity Speak: Characterising Prediction Matrix for
Unsupervised Accuracy Estimation [51.809741427975105]
This work aims to assess how well a model performs under distribution shifts without using labels.
We use the nuclear norm that has been shown to be effective in characterizing both properties.
We show that the nuclear norm is more accurate and robust in accuracy than existing methods.
arXiv Detail & Related papers (2023-02-02T13:30:48Z) - Usable Region Estimate for Assessing Practical Usability of Medical
Image Segmentation Models [32.56957759180135]
We aim to quantitatively measure the practical usability of medical image segmentation models.
We first propose a measure, Correctness-Confidence Rank Correlation (CCRC), to capture how predictions' confidence estimates correlate with their correctness scores in rank.
We then propose Usable Region Estimate (URE), which simultaneously quantifies predictions' correctness and reliability of confidence assessments in one estimate.
arXiv Detail & Related papers (2022-07-01T02:33:44Z) - Learning Accurate Dense Correspondences and When to Trust Them [161.76275845530964]
We aim to estimate a dense flow field relating two images, coupled with a robust pixel-wise confidence map.
We develop a flexible probabilistic approach that jointly learns the flow prediction and its uncertainty.
Our approach obtains state-of-the-art results on challenging geometric matching and optical flow datasets.
arXiv Detail & Related papers (2021-01-05T18:54:11Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.