Learning and Evaluating Human Preferences for Conversational Head
Generation
- URL: http://arxiv.org/abs/2307.10636v2
- Date: Wed, 2 Aug 2023 04:08:47 GMT
- Title: Learning and Evaluating Human Preferences for Conversational Head
Generation
- Authors: Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, Tao Mei
- Abstract summary: We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
- Score: 101.89332968344102
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A reliable and comprehensive evaluation metric that aligns with manual
preference assessments is crucial for conversational head video synthesis
methods development. Existing quantitative evaluations often fail to capture
the full complexity of human preference, as they only consider limited
evaluation dimensions. Qualitative evaluations and user studies offer a
solution but are time-consuming and labor-intensive. This limitation hinders
the advancement of conversational head generation algorithms and systems. In
this paper, we propose a novel learning-based evaluation metric named
Preference Score (PS) for fitting human preference according to the
quantitative evaluations across different dimensions. PS can serve as a
quantitative evaluation without the need for human annotation. Experimental
results validate the superiority of Preference Score in aligning with human
perception, and also demonstrate robustness and generalizability to unseen
data, making it a valuable tool for advancing conversation head generation. We
expect this metric could facilitate new advances in conversational head
generation. Project Page: https://https://github.com/dc3ea9f/PreferenceScore.
Related papers
- A Comparative Study of Perceptual Quality Metrics for Audio-driven
Talking Head Videos [81.54357891748087]
We collect talking head videos generated from four generative methods.
We conduct controlled psychophysical experiments on visual quality, lip-audio synchronization, and head movement naturalness.
Our experiments validate consistency between model predictions and human annotations, identifying metrics that align better with human opinions than widely-used measures.
arXiv Detail & Related papers (2024-03-11T04:13:38Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - Dynamic Human Evaluation for Relative Model Comparisons [8.843915018287476]
We present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings.
We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study.
arXiv Detail & Related papers (2021-12-15T11:32:13Z) - Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on
Recent Papers [0.685316573653194]
We survey human evaluation in papers presenting work on creative natural language generation.
The most typical human evaluation method is a scaled survey, typically on a 5 point scale.
The most commonly evaluated parameters are meaning, syntactic correctness, novelty, relevance and emotional value.
arXiv Detail & Related papers (2021-07-31T18:54:30Z) - Towards Quantifiable Dialogue Coherence Evaluation [126.55560816209756]
Quantifiable Dialogue Coherence Evaluation (QuantiDCE) is a novel framework aiming to train a quantifiable dialogue coherence metric.
QuantiDCE includes two training stages, Multi-Level Ranking (MLR) pre-training and Knowledge Distillation (KD) fine-tuning.
Experimental results show that the model trained by QuantiDCE presents stronger correlations with human judgements than the other state-of-the-art metrics.
arXiv Detail & Related papers (2021-06-01T14:11:17Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - What comprises a good talking-head video generation?: A Survey and
Benchmark [40.26689818789428]
We present a benchmark for evaluating talking-head video generation with standardized dataset pre-processing strategies.
We propose new metrics or select the most appropriate ones to evaluate results in what we consider as desired properties for a good talking-head video.
arXiv Detail & Related papers (2020-05-07T01:58:05Z) - Designing Precise and Robust Dialogue Response Evaluators [35.137244385158034]
We propose to build a reference-free evaluator and exploit the power of semi-supervised training and pretrained language models.
Experimental results demonstrate that the proposed evaluator achieves a strong correlation (> 0.6) with human judgement.
arXiv Detail & Related papers (2020-04-10T04:59:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.