How Aligned are Different Alignment Metrics?
- URL: http://arxiv.org/abs/2407.07530v1
- Date: Wed, 10 Jul 2024 10:36:11 GMT
- Title: How Aligned are Different Alignment Metrics?
- Authors: Jannis Ahlert, Thomas Klein, Felix Wichmann, Robert Geirhos,
- Abstract summary: We analyze visual data from Brain-Score, together with human similarity and alignment metrics.
We find that pairwise correlations between neural scores and behavioral scores are quite low and sometimes even negative.
Our results underline the importance of integrative benchmarking, but also raise questions about how to correctly combine and aggregate individual metrics.
- Score: 6.172390472790253
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In recent years, various methods and benchmarks have been proposed to empirically evaluate the alignment of artificial neural networks to human neural and behavioral data. But how aligned are different alignment metrics? To answer this question, we analyze visual data from Brain-Score (Schrimpf et al., 2018), including metrics from the model-vs-human toolbox (Geirhos et al., 2021), together with human feature alignment (Linsley et al., 2018; Fel et al., 2022) and human similarity judgements (Muttenthaler et al., 2022). We find that pairwise correlations between neural scores and behavioral scores are quite low and sometimes even negative. For instance, the average correlation between those 80 models on Brain-Score that were fully evaluated on all 69 alignment metrics we considered is only 0.198. Assuming that all of the employed metrics are sound, this implies that alignment with human perception may best be thought of as a multidimensional concept, with different methods measuring fundamentally different aspects. Our results underline the importance of integrative benchmarking, but also raise questions about how to correctly combine and aggregate individual metrics. Aggregating by taking the arithmetic average, as done in Brain-Score, leads to the overall performance currently being dominated by behavior (95.25% explained variance) while the neural predictivity plays a less important role (only 33.33% explained variance). As a first step towards making sure that different alignment metrics all contribute fairly towards an integrative benchmark score, we therefore conclude by comparing three different aggregation options.
Related papers
- An unsupervised learning approach to evaluate questionnaire data -- what
one can learn from violations of measurement invariance [2.4762962548352467]
This paper promotes an unsupervised learning-based approach to such research data.
It works in three phases: data preparation, clustering of questionnaires, and measuring similarity based on the obtained clustering and the properties of each group.
It provides a natural comparison between groups and a natural description of the response patterns of the groups.
arXiv Detail & Related papers (2023-12-11T11:31:41Z) - Revisiting Evaluation Metrics for Semantic Segmentation: Optimization
and Evaluation of Fine-grained Intersection over Union [113.20223082664681]
We propose the use of fine-grained mIoUs along with corresponding worst-case metrics.
These fine-grained metrics offer less bias towards large objects, richer statistical information, and valuable insights into model and dataset auditing.
Our benchmark study highlights the necessity of not basing evaluations on a single metric and confirms that fine-grained mIoUs reduce the bias towards large objects.
arXiv Detail & Related papers (2023-10-30T03:45:15Z) - Can neural networks count digit frequency? [16.04455549316468]
We compare the performance of different classical machine learning models and neural networks in identifying the frequency of occurrence of each digit in a given number.
We observe that the neural networks significantly outperform the classical machine learning models in terms of both the regression and classification metrics for both the 6-digit and 10-digit number.
arXiv Detail & Related papers (2023-09-25T03:45:36Z) - What is the Best Automated Metric for Text to Motion Generation? [19.71712698183703]
There is growing interest in generating skeleton-based human motions from natural language descriptions.
Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments.
This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better.
arXiv Detail & Related papers (2023-09-19T01:59:54Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z) - The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z) - Estimating Structural Disparities for Face Models [54.062512989859265]
In machine learning, disparity metrics are often defined by measuring the difference in the performance or outcome of a model, across different sub-populations.
We explore performing such analysis on computer vision models trained on human faces, and on tasks such as face attribute prediction and affect estimation.
arXiv Detail & Related papers (2022-04-13T05:30:53Z) - A First Step Towards Distribution Invariant Regression Metrics [1.370633147306388]
In classification, it has been stated repeatedly that performance metrics like the F-Measure and Accuracy are highly dependent on the class distribution.
We show that the same problem exists in regression. The distribution of odometry parameters in robotic applications can for example largely vary between different recording sessions.
Here, we need regression algorithms that either perform equally well for all function values, or that focus on certain boundary regions like high speed.
arXiv Detail & Related papers (2020-09-10T23:40:46Z) - Automatic sleep stage classification with deep residual networks in a
mixed-cohort setting [63.52264764099532]
We developed a novel deep neural network model to assess the generalizability of several large-scale cohorts.
Overall classification accuracy improved with increasing fractions of training data.
arXiv Detail & Related papers (2020-08-21T10:48:35Z) - Batch Decorrelation for Active Metric Learning [21.99577268213412]
We present an active learning strategy for training parametric models of distance metrics, given triplet-based similarity assessments.
In contrast to prior work on class-based learning, we focus on em metrics that express the em degree of (dis)similarity between objects.
arXiv Detail & Related papers (2020-05-20T12:47:48Z) - Learning from Aggregate Observations [82.44304647051243]
We study the problem of learning from aggregate observations where supervision signals are given to sets of instances.
We present a general probabilistic framework that accommodates a variety of aggregate observations.
Simple maximum likelihood solutions can be applied to various differentiable models.
arXiv Detail & Related papers (2020-04-14T06:18:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.