Establishing a Unified Evaluation Framework for Human Motion Generation: A Comparative Analysis of Metrics
- URL: http://arxiv.org/abs/2405.07680v1
- Date: Mon, 13 May 2024 12:10:57 GMT
- Title: Establishing a Unified Evaluation Framework for Human Motion Generation: A Comparative Analysis of Metrics
- Authors: Ali Ismail-Fawaz, Maxime Devanne, Stefano Berretti, Jonathan Weber, Germain Forestier,
- Abstract summary: This paper presents a detailed review of eight evaluation metrics for human motion generation.
We propose standardized practices through a unified evaluation setup to facilitate consistent model comparisons.
We introduce a novel metric that assesses diversity in temporal distortion by analyzing warping diversity.
- Score: 6.708543240320757
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The development of generative artificial intelligence for human motion generation has expanded rapidly, necessitating a unified evaluation framework. This paper presents a detailed review of eight evaluation metrics for human motion generation, highlighting their unique features and shortcomings. We propose standardized practices through a unified evaluation setup to facilitate consistent model comparisons. Additionally, we introduce a novel metric that assesses diversity in temporal distortion by analyzing warping diversity, thereby enhancing the evaluation of temporal data. We also conduct experimental analyses of three generative models using a publicly available dataset, offering insights into the interpretation of each metric in specific case scenarios. Our goal is to offer a clear, user-friendly evaluation framework for newcomers, complemented by publicly accessible code.
Related papers
- Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - On the Interpretability of Part-Prototype Based Classifiers: A Human
Centric Analysis [4.465883551216819]
Part-prototype networks have recently become methods of interest as an interpretable alternative to many of the current black-box image classifiers.
We have devised a framework for evaluating the interpretability of part-prototype-based models from a human perspective.
arXiv Detail & Related papers (2023-10-10T19:32:59Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - UMSE: Unified Multi-scenario Summarization Evaluation [52.60867881867428]
Summarization quality evaluation is a non-trivial task in text summarization.
We propose Unified Multi-scenario Summarization Evaluation Model (UMSE)
Our UMSE is the first unified summarization evaluation framework engaged with the ability to be used in three evaluation scenarios.
arXiv Detail & Related papers (2023-05-26T12:54:44Z) - AcME -- Accelerated Model-agnostic Explanations: Fast Whitening of the
Machine-Learning Black Box [1.7534486934148554]
interpretability approaches should provide actionable insights without making the users wait.
We propose Accelerated Model-agnostic Explanations (AcME), an interpretability approach that quickly provides feature importance scores both at the global and the local level.
AcME computes feature ranking, but it also provides a what-if analysis tool to assess how changes in features values would affect model predictions.
arXiv Detail & Related papers (2021-12-23T15:18:13Z) - How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating
and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion.
Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z) - Evaluations and Methods for Explanation through Robustness Analysis [117.7235152610957]
We establish a novel set of evaluation criteria for such feature based explanations by analysis.
We obtain new explanations that are loosely necessary and sufficient for a prediction.
We extend the explanation to extract the set of features that would move the current prediction to a target class.
arXiv Detail & Related papers (2020-05-31T05:52:05Z) - Learning by Semantic Similarity Makes Abstractive Summarization Better [13.324006587838522]
We compare the generated summaries from recent LM, BART, and the reference summaries from a benchmark dataset, CNN/DM.
Interestingly, model-generated summaries receive higher scores relative to reference summaries.
arXiv Detail & Related papers (2020-02-18T17:59:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.