Related papers: How Aligned are Different Alignment Metrics?

How Aligned are Different Alignment Metrics?

URL: http://arxiv.org/abs/2407.07530v1
Date: Wed, 10 Jul 2024 10:36:11 GMT
Title: How Aligned are Different Alignment Metrics?
Authors: Jannis Ahlert, Thomas Klein, Felix Wichmann, Robert Geirhos,
Abstract summary: We analyze visual data from Brain-Score, together with human similarity and alignment metrics. We find that pairwise correlations between neural scores and behavioral scores are quite low and sometimes even negative. Our results underline the importance of integrative benchmarking, but also raise questions about how to correctly combine and aggregate individual metrics.
Score: 6.172390472790253
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In recent years, various methods and benchmarks have been proposed to empirically evaluate the alignment of artificial neural networks to human neural and behavioral data. But how aligned are different alignment metrics? To answer this question, we analyze visual data from Brain-Score (Schrimpf et al., 2018), including metrics from the model-vs-human toolbox (Geirhos et al., 2021), together with human feature alignment (Linsley et al., 2018; Fel et al., 2022) and human similarity judgements (Muttenthaler et al., 2022). We find that pairwise correlations between neural scores and behavioral scores are quite low and sometimes even negative. For instance, the average correlation between those 80 models on Brain-Score that were fully evaluated on all 69 alignment metrics we considered is only 0.198. Assuming that all of the employed metrics are sound, this implies that alignment with human perception may best be thought of as a multidimensional concept, with different methods measuring fundamentally different aspects. Our results underline the importance of integrative benchmarking, but also raise questions about how to correctly combine and aggregate individual metrics. Aggregating by taking the arithmetic average, as done in Brain-Score, leads to the overall performance currently being dominated by behavior (95.25% explained variance) while the neural predictivity plays a less important role (only 33.33% explained variance). As a first step towards making sure that different alignment metrics all contribute fairly towards an integrative benchmark score, we therefore conclude by comparing three different aggregation options.

Related papers

A Distance Metric for Mixed Integer Programming Instances [0.0]
Mixed-integer linear programming (MILP) is a powerful tool for addressing a wide range of real-world problems.<n>Existing similarity metrics often lack precision in identifying instance classes or rely heavily on labeled data.<n>This paper introduces the first mathematical distance metric for MILP instances, derived directly from their mathematical formulations.
arXiv Detail & Related papers (2025-07-15T07:55:09Z)
What should an AI assessor optimise for? [57.96463917842822]
An AI assessor is an external, ideally indepen-dent system that predicts an indicator, e.g., a loss value, of another AI system. Here we address the question: is it always optimal to train the assessor for the target metric? We experimentally explore this question for, respectively, regression losses and classification scores with monotonic and non-monotonic mappings.
arXiv Detail & Related papers (2025-02-01T08:41:57Z)
Evaluating Representational Similarity Measures from the Lens of Functional Correspondence [1.7811840395202345]
Neuroscience and artificial intelligence (AI) both face the challenge of interpreting high-dimensional neural data. Despite the widespread use of representational comparisons, a critical question remains: which metrics are most suitable for these comparisons?
arXiv Detail & Related papers (2024-11-21T23:53:58Z)
Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge [51.93909886542317]
We show how a single aggregate correlation score can obscure differences between human behavior and automatic evaluation methods. We propose stratifying results by human label uncertainty to provide a more robust analysis of automatic evaluation performance.
arXiv Detail & Related papers (2024-10-03T03:08:29Z)
Revisiting Evaluation Metrics for Semantic Segmentation: Optimization and Evaluation of Fine-grained Intersection over Union [113.20223082664681]
We propose the use of fine-grained mIoUs along with corresponding worst-case metrics. These fine-grained metrics offer less bias towards large objects, richer statistical information, and valuable insights into model and dataset auditing. Our benchmark study highlights the necessity of not basing evaluations on a single metric and confirms that fine-grained mIoUs reduce the bias towards large objects.
arXiv Detail & Related papers (2023-10-30T03:45:15Z)
Can neural networks count digit frequency? [16.04455549316468]
We compare the performance of different classical machine learning models and neural networks in identifying the frequency of occurrence of each digit in a given number. We observe that the neural networks significantly outperform the classical machine learning models in terms of both the regression and classification metrics for both the 6-digit and 10-digit number.
arXiv Detail & Related papers (2023-09-25T03:45:36Z)
What is the Best Automated Metric for Text to Motion Generation? [19.71712698183703]
There is growing interest in generating skeleton-based human motions from natural language descriptions. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments. This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better.
arXiv Detail & Related papers (2023-09-19T01:59:54Z)
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation. We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings. SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z)
The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics. Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z)
Estimating Structural Disparities for Face Models [54.062512989859265]
In machine learning, disparity metrics are often defined by measuring the difference in the performance or outcome of a model, across different sub-populations. We explore performing such analysis on computer vision models trained on human faces, and on tasks such as face attribute prediction and affect estimation.
arXiv Detail & Related papers (2022-04-13T05:30:53Z)
A First Step Towards Distribution Invariant Regression Metrics [1.370633147306388]
In classification, it has been stated repeatedly that performance metrics like the F-Measure and Accuracy are highly dependent on the class distribution. We show that the same problem exists in regression. The distribution of odometry parameters in robotic applications can for example largely vary between different recording sessions. Here, we need regression algorithms that either perform equally well for all function values, or that focus on certain boundary regions like high speed.
arXiv Detail & Related papers (2020-09-10T23:40:46Z)
Automatic sleep stage classification with deep residual networks in a mixed-cohort setting [63.52264764099532]
We developed a novel deep neural network model to assess the generalizability of several large-scale cohorts. Overall classification accuracy improved with increasing fractions of training data.
arXiv Detail & Related papers (2020-08-21T10:48:35Z)
Batch Decorrelation for Active Metric Learning [21.99577268213412]
We present an active learning strategy for training parametric models of distance metrics, given triplet-based similarity assessments. In contrast to prior work on class-based learning, we focus on em metrics that express the em degree of (dis)similarity between objects.
arXiv Detail & Related papers (2020-05-20T12:47:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.