CAME: Contrastive Automated Model Evaluation
- URL: http://arxiv.org/abs/2308.11111v1
- Date: Tue, 22 Aug 2023 01:24:14 GMT
- Title: CAME: Contrastive Automated Model Evaluation
- Authors: Ru Peng, Qiuyang Duan, Haobo Wang, Jiachen Ma, Yanbo Jiang, Yongjun
Tu, Xiu Jiang, Junbo Zhao
- Abstract summary: Contrastive Automatic Model Evaluation (CAME) is a novel AutoEval framework that is rid of involving training set in the loop.
CAME establishes a new SOTA results for AutoEval by surpassing prior work significantly.
- Score: 12.879345202312628
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Automated Model Evaluation (AutoEval) framework entertains the
possibility of evaluating a trained machine learning model without resorting to
a labeled testing set. Despite the promise and some decent results, the
existing AutoEval methods heavily rely on computing distribution shifts between
the unlabelled testing set and the training set. We believe this reliance on
the training set becomes another obstacle in shipping this technology to
real-world ML development. In this work, we propose Contrastive Automatic Model
Evaluation (CAME), a novel AutoEval framework that is rid of involving training
set in the loop. The core idea of CAME bases on a theoretical analysis which
bonds the model performance with a contrastive loss. Further, with extensive
empirical validation, we manage to set up a predictable relationship between
the two, simply by deducing on the unlabeled/unseen testing set. The resulting
framework CAME establishes a new SOTA results for AutoEval by surpassing prior
work significantly.
Related papers
- FullCert: Deterministic End-to-End Certification for Training and Inference of Neural Networks [62.897993591443594]
FullCert is the first end-to-end certifier with sound, deterministic bounds.
We experimentally demonstrate FullCert's feasibility on two datasets.
arXiv Detail & Related papers (2024-06-17T13:23:52Z) - FACTUAL: A Novel Framework for Contrastive Learning Based Robust SAR Image Classification [10.911464455072391]
FACTUAL is a Contrastive Learning framework for Adversarial Training and robust SAR classification.
Our model achieves 99.7% accuracy on clean samples, and 89.6% on perturbed samples, both outperforming previous state-of-the-art methods.
arXiv Detail & Related papers (2024-04-04T06:20:22Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Rubric-Specific Approach to Automated Essay Scoring with Augmentation
Training [0.1227734309612871]
We propose a series of data augmentation operations that train and test an automated scoring model to learn features and functions overlooked by previous works.
We achieve state-of-the-art performance in the Automated Student Assessment Prize dataset.
arXiv Detail & Related papers (2023-09-06T05:51:19Z) - Test-Time Adaptation with Perturbation Consistency Learning [32.58879780726279]
We propose a simple test-time adaptation method to promote the model to make stable predictions for samples with distribution shifts.
Our method can achieve higher or comparable performance with less inference time over strong PLM backbones.
arXiv Detail & Related papers (2023-04-25T12:29:22Z) - Effective Robustness against Natural Distribution Shifts for Models with
Different Training Data [113.21868839569]
"Effective robustness" measures the extra out-of-distribution robustness beyond what can be predicted from the in-distribution (ID) performance.
We propose a new evaluation metric to evaluate and compare the effective robustness of models trained on different data.
arXiv Detail & Related papers (2023-02-02T19:28:41Z) - Self-Damaging Contrastive Learning [92.34124578823977]
Unlabeled data in reality is commonly imbalanced and shows a long-tail distribution.
This paper proposes a principled framework called Self-Damaging Contrastive Learning to automatically balance the representation learning without knowing the classes.
Our experiments show that SDCLR significantly improves not only overall accuracies but also balancedness.
arXiv Detail & Related papers (2021-06-06T00:04:49Z) - Once-for-All Adversarial Training: In-Situ Tradeoff between Robustness
and Accuracy for Free [115.81899803240758]
Adversarial training and its many variants substantially improve deep network robustness, yet at the cost of compromising standard accuracy.
This paper asks how to quickly calibrate a trained model in-situ, to examine the achievable trade-offs between its standard and robust accuracies.
Our proposed framework, Once-for-all Adversarial Training (OAT), is built on an innovative model-conditional training framework.
arXiv Detail & Related papers (2020-10-22T16:06:34Z) - SE3M: A Model for Software Effort Estimation Using Pre-trained Embedding
Models [0.8287206589886881]
This paper proposes to evaluate the effectiveness of pre-trained embeddings models.
Generic pre-trained models for both approaches went through a fine-tuning process.
Results were very promising, realizing that pre-trained models can be used to estimate software effort based only on requirements texts.
arXiv Detail & Related papers (2020-06-30T14:15:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.