Related papers: A Computational Theory for Efficient Model Evaluation with Causal Guarantees

Related papers

Practical Improvements of A/B Testing with Off-Policy Estimation [51.25970890274447]
We introduce a family of unbiased off-policy estimators that achieves lower variance than the standard approach.<n>Our theoretical analysis and experimental results validate the effectiveness and practicality of the proposed method.
arXiv Detail & Related papers (2025-06-12T13:11:01Z)
HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation [25.193026443079987]
HypoEval is a Hypothesis-guided Evaluation framework for large language models (LLMs)<n>With only 30 human evaluations, HypoEval achieves state-of-the-art performance in alignment with both human rankings (Spearman correlation) and human scores (Pearson correlation)<n>We conduct systematic studies to assess the robustness of HypoEval, highlighting its effectiveness as a reliable and interpretable automated evaluation framework.
arXiv Detail & Related papers (2025-04-09T18:00:01Z)
Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models [51.067146460271466]
Evaluation of visual generative models can be time-consuming and computationally expensive. We propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations. It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools.
arXiv Detail & Related papers (2024-12-10T18:52:39Z)
Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations [0.6526824510982799]
The literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning. This article shows researchers with some training in statistics how to think about and analyze data from language model evaluations.
arXiv Detail & Related papers (2024-11-01T14:57:16Z)
Poor-Supervised Evaluation for SuperLLM via Mutual Consistency [20.138831477848615]
We propose the PoEM framework to conduct evaluation without accurate labels. We first prove that the capability of a model can be equivalently assessed by the consistency between it and certain reference model. To alleviate the insufficiencies of the conditions in reality, we introduce an algorithm that treats humans (when available) and the models under evaluation as reference models.
arXiv Detail & Related papers (2024-08-25T06:49:03Z)
Area under the ROC Curve has the Most Consistent Evaluation for Binary Classification [3.1850615666574806]
This study investigates how consistent different metrics are at evaluating models across data of different prevalence. I find that evaluation metrics that are less influenced by prevalence offer more consistent evaluation of individual models and more consistent ranking of a set of models.
arXiv Detail & Related papers (2024-08-19T17:52:38Z)
Towards Reliable Empirical Machine Unlearning Evaluation: A Cryptographic Game Perspective [5.724350004671127]
Machine unlearning updates machine learning models to remove information from specific training samples, complying with data protection regulations.<n>Despite the recent development of numerous unlearning algorithms, reliable evaluation of these algorithms remains an open research question.<n>This work presents a novel and reliable approach to empirically evaluating unlearning algorithms, paving the way for the development of more effective unlearning techniques.
arXiv Detail & Related papers (2024-04-17T17:20:27Z)
Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs. We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z)
High Precision Causal Model Evaluation with Conditional Randomization [10.23470075454725]
We introduce a novel low-variance estimator for causal error, dubbed as the pairs estimator. By applying the same IPW estimator to both the model and true experimental effects, our estimator effectively cancels out the variance due to IPW and achieves a smaller variance. Our method offers a simple yet powerful solution to evaluate causal inference models in conditional randomization settings without complicated modification of the IPW estimator itself.
arXiv Detail & Related papers (2023-11-03T13:22:27Z)
Learning Evaluation Models from Large Language Models for Sequence Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development. Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv Detail & Related papers (2023-08-08T16:41:16Z)
From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time. We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Hyperparameter Tuning and Model Evaluation in Causal Effect Estimation [2.7823528791601686]
This paper investigates the interplay between the four different aspects of model evaluation for causal effect estimation. We find that most causal estimators are roughly equivalent in performance if tuned thoroughly enough. We call for more research into causal model evaluation to unlock the optimum performance not currently being delivered even by state-of-the-art procedures.
arXiv Detail & Related papers (2023-03-02T17:03:02Z)
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale. We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units. We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z)
Neural Causal Models for Counterfactual Identification and Estimation [62.30444687707919]
We study the evaluation of counterfactual statements through neural models. First, we show that neural causal models (NCMs) are expressive enough. Second, we develop an algorithm for simultaneously identifying and estimating counterfactual distributions.
arXiv Detail & Related papers (2022-09-30T18:29:09Z)
Meta Pattern Concern Score: A Novel Evaluation Measure with Human Values for Multi-classifiers [4.983066629141241]
We propose a novel evaluation measure named Meta Pattern Concern Score. We learn from the advantages and disadvantages of two kinds of common metrics, namely the confusion matrix-based evaluation measures and the loss values. Our measure can also be used to refine the model training by dynamically adjusting the learning rate.
arXiv Detail & Related papers (2022-09-14T04:28:15Z)
An evaluation framework for comparing causal inference models [3.1372269816123994]
We use the proposed evaluation methodology to compare several state-of-the-art causal effect estimation models. The main motivation behind this approach is the elimination of the influence of a small number of instances or simulation on the benchmarking process.
arXiv Detail & Related papers (2022-08-31T21:04:20Z)
Doing Great at Estimating CATE? On the Neglected Assumptions in Benchmark Comparisons of Treatment Effect Estimators [91.3755431537592]
We show that even in arguably the simplest setting, estimation under ignorability assumptions can be misleading. We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators. We highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others.
arXiv Detail & Related papers (2021-07-28T13:21:27Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
Causal Inference of General Treatment Effects using Neural Networks with A Diverging Number of Confounders [12.105996764226227]
Under the unconfoundedness condition, adjustment for confounders requires estimating the nuisance functions relating outcome or treatment to confounders nonparametrically. This paper considers a generalized optimization framework for efficient estimation of general treatment effects using artificial neural networks (ANNs) to approximate the unknown nuisance function of growing-dimensional confounders.
arXiv Detail & Related papers (2020-09-15T13:07:24Z)
Performance metrics for intervention-triggering prediction models do not reflect an expected reduction in outcomes from using the model [71.9860741092209]
Clinical researchers often select among and evaluate risk prediction models. Standard metrics calculated from retrospective data are only related to model utility under certain assumptions. When predictions are delivered repeatedly throughout time, the relationship between standard metrics and utility is further complicated.
arXiv Detail & Related papers (2020-06-02T16:26:49Z)
Generalization Bounds and Representation Learning for Estimation of Potential Outcomes and Causal Effects [61.03579766573421]
We study estimation of individual-level causal effects, such as a single patient's response to alternative medication. We devise representation learning algorithms that minimize our bound, by regularizing the representation's induced treatment group distance. We extend these algorithms to simultaneously learn a weighted representation to further reduce treatment group distances.
arXiv Detail & Related papers (2020-01-21T10:16:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.