Firenze: Model Evaluation Using Weak Signals
- URL: http://arxiv.org/abs/2207.00827v1
- Date: Sat, 2 Jul 2022 13:20:38 GMT
- Title: Firenze: Model Evaluation Using Weak Signals
- Authors: Bhavna Soman, Ali Torkamani, Michael J. Morais, Jeffrey Bickford,
Baris Coskun
- Abstract summary: We introduce Firenze, a novel framework for comparative evaluation of machine learning models' performance.
We show that markers computed and combined over select subsets of samples called regions of interest can provide a robust estimate of their real-world performances.
- Score: 5.723905680436377
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data labels in the security field are frequently noisy, limited, or biased
towards a subset of the population. As a result, commonplace evaluation methods
such as accuracy, precision and recall metrics, or analysis of performance
curves computed from labeled datasets do not provide sufficient confidence in
the real-world performance of a machine learning (ML) model. This has slowed
the adoption of machine learning in the field. In the industry today, we rely
on domain expertise and lengthy manual evaluation to build this confidence
before shipping a new model for security applications. In this paper, we
introduce Firenze, a novel framework for comparative evaluation of ML models'
performance using domain expertise, encoded into scalable functions called
markers. We show that markers computed and combined over select subsets of
samples called regions of interest can provide a robust estimate of their
real-world performances. Critically, we use statistical hypothesis testing to
ensure that observed differences-and therefore conclusions emerging from our
framework-are more prominent than that observable from the noise alone. Using
simulations and two real-world datasets for malware and domain-name-service
reputation detection, we illustrate our approach's effectiveness, limitations,
and insights. Taken together, we propose Firenze as a resource for fast,
interpretable, and collaborative model development and evaluation by mixed
teams of researchers, domain experts, and business owners.
Related papers
- Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.
We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.
Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - Determining Domain of Machine Learning Models using Kernel Density Estimates: Applications in Materials Property Prediction [1.8551396341435895]
We develop a new approach of assessing model domain using kernel density estimation.
We show that chemical groups considered unrelated based on established chemical knowledge exhibit significant dissimilarities by our measure.
High measures of dissimilarity are associated with poor model performance and poor estimates of model uncertainty.
arXiv Detail & Related papers (2024-05-28T15:41:16Z) - Test-time Assessment of a Model's Performance on Unseen Domains via Optimal Transport [8.425690424016986]
Gauging the performance of ML models on data from unseen domains at test-time is essential.
It is essential to develop metrics that can provide insights into the model's performance at test time.
We propose a metric based on Optimal Transport that is highly correlated with the model's performance on unseen domains.
arXiv Detail & Related papers (2024-05-02T16:35:07Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models.
We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench.
GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z) - Exploring validation metrics for offline model-based optimisation with
diffusion models [50.404829846182764]
In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle.
While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples.
This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation.
arXiv Detail & Related papers (2022-11-19T16:57:37Z) - DATa: Domain Adaptation-Aided Deep Table Detection Using Visual-Lexical
Representations [2.542864854772221]
We present a novel Domain Adaptation-aided deep Table detection method called DATa.
It guarantees satisfactory performance in a specific target domain where few trusted labels are available.
Experiments show that DATa substantially outperforms competing methods that only utilize visual representations in the target domain.
arXiv Detail & Related papers (2022-11-12T12:14:16Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Who Explains the Explanation? Quantitatively Assessing Feature
Attribution Methods [0.0]
We propose a novel evaluation metric -- the Focus -- designed to quantify the faithfulness of explanations.
We show the robustness of the metric through randomization experiments, and then use Focus to evaluate and compare three popular explainability techniques.
Our results find LRP and GradCAM to be consistent and reliable, while the latter remains most competitive even when applied to poorly performing models.
arXiv Detail & Related papers (2021-09-28T07:10:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.