Evaluation metrics for behaviour modeling
- URL: http://arxiv.org/abs/2007.12298v1
- Date: Thu, 23 Jul 2020 23:47:24 GMT
- Title: Evaluation metrics for behaviour modeling
- Authors: Daniel Jiwoong Im, Iljung Kwak, Kristin Branson
- Abstract summary: We propose and investigate metrics for evaluating and comparing generative models of behavior learned using imitation learning.
These criteria look at longer temporal relationships in behavior, are relevant if behavior has some properties that are inherently unpredictable, and highlight biases in the overall distribution of behaviors produced by the model.
We show that the proposed metrics correspond with biologists' intuition about behavior, and allow us to evaluate models, understand their biases, and enable us to propose new research directions.
- Score: 2.616915680939834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A primary difficulty with unsupervised discovery of structure in large data
sets is a lack of quantitative evaluation criteria. In this work, we propose
and investigate several metrics for evaluating and comparing generative models
of behavior learned using imitation learning. Compared to the commonly-used
model log-likelihood, these criteria look at longer temporal relationships in
behavior, are relevant if behavior has some properties that are inherently
unpredictable, and highlight biases in the overall distribution of behaviors
produced by the model. Pointwise metrics compare real to model-predicted
trajectories given true past information. Distribution metrics compare
statistics of the model simulating behavior in open loop, and are inspired by
how experimental biologists evaluate the effects of manipulations on animal
behavior. We show that the proposed metrics correspond with biologists'
intuitions about behavior, and allow us to evaluate models, understand their
biases, and enable us to propose new research directions.
Related papers
- Analyzing Generative Models by Manifold Entropic Metrics [8.477943884416023]
We introduce a novel set of tractable information-theoretic evaluation metrics.
We compare various normalizing flow architectures and $beta$-VAEs on the EMNIST dataset.
The most interesting finding of our experiments is a ranking of model architectures and training procedures in terms of their inductive bias to converge to aligned and disentangled representations during training.
arXiv Detail & Related papers (2024-10-25T09:35:00Z) - Estimating Causal Effects from Learned Causal Networks [56.14597641617531]
We propose an alternative paradigm for answering causal-effect queries over discrete observable variables.
We learn the causal Bayesian network and its confounding latent variables directly from the observational data.
We show that this emphmodel completion learning approach can be more effective than estimand approaches.
arXiv Detail & Related papers (2024-08-26T08:39:09Z) - Area under the ROC Curve has the Most Consistent Evaluation for Binary Classification [3.1850615666574806]
This study investigates how consistent different metrics are at evaluating models across data of different prevalence.
I find that evaluation metrics that are less influenced by prevalence offer more consistent evaluation of individual models and more consistent ranking of a set of models.
arXiv Detail & Related papers (2024-08-19T17:52:38Z) - Causal Estimation of Memorisation Profiles [58.20086589761273]
Understanding memorisation in language models has practical and societal implications.
Memorisation is the causal effect of training with an instance on the model's ability to predict that instance.
This paper proposes a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics.
arXiv Detail & Related papers (2024-06-06T17:59:09Z) - Faithful Model Evaluation for Model-Based Metrics [22.753929098534403]
We establish the mathematical foundation of significance testing for model-based metrics.
We show that considering metric model errors to calculate sample variances for model-based metrics changes the conclusions in certain experiments.
arXiv Detail & Related papers (2023-12-19T19:41:33Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - Comparing merging behaviors observed in naturalistic data with behaviors
generated by a machine learned model [4.879725885276143]
We study highway driving as an example scenario, and introduce metrics to quantitatively demonstrate the presence of two familiar behavioral phenomena.
Applying the exact same metrics to the output of a state-of-the-art machine-learned model, we show that the model is capable of reproducing the former phenomenon, but not the latter.
arXiv Detail & Related papers (2021-04-21T12:31:29Z) - Goal-directed Generation of Discrete Structures with Conditional
Generative Models [85.51463588099556]
We introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward.
We test our methodology on two tasks: generating molecules with user-defined properties and identifying short python expressions which evaluate to a given target value.
arXiv Detail & Related papers (2020-10-05T20:03:13Z) - Evaluating the Disentanglement of Deep Generative Models through
Manifold Topology [66.06153115971732]
We present a method for quantifying disentanglement that only uses the generative model.
We empirically evaluate several state-of-the-art models across multiple datasets.
arXiv Detail & Related papers (2020-06-05T20:54:11Z) - Performance metrics for intervention-triggering prediction models do not
reflect an expected reduction in outcomes from using the model [71.9860741092209]
Clinical researchers often select among and evaluate risk prediction models.
Standard metrics calculated from retrospective data are only related to model utility under certain assumptions.
When predictions are delivered repeatedly throughout time, the relationship between standard metrics and utility is further complicated.
arXiv Detail & Related papers (2020-06-02T16:26:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.