Fugu-MT 論文翻訳(概要): Gesture Generation (Still) Needs Improved Human Evaluation Practices: Insights from a Community-Driven State-of-the-Art Benchmark

論文の概要: Gesture Generation (Still) Needs Improved Human Evaluation Practices: Insights from a Community-Driven State-of-the-Art Benchmark

arxiv url: http://arxiv.org/abs/2511.01233v1
Date: Mon, 03 Nov 2025 05:17:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 16:37:27.124233
Title: Gesture Generation (Still) Needs Improved Human Evaluation Practices: Insights from a Community-Driven State-of-the-Art Benchmark
Title（参考訳）: ジェスチャ生成(Still)は人的評価の実践を改善する必要がある:コミュニティ駆動型ベンチマークからの洞察
Authors: Rajmund Nagy, Hendric Voss, Thanh Hoang-Minh, Mihail Tsakov, Teodor Nikolov, Zeyi Zhang, Tenglong Ao, Sicheng Yang, Shaoli Huang, Yongkang Cheng, M. Hamza Mughal, Rishabh Dabral, Kiran Chhatre, Christian Theobalt, Libin Liu, Stefan Kopp, Rachel McDonnell, Michael Neff, Taras Kucherenko, Youngwoo Yoon, Gustav Eje Henter,
Abstract要約: 音声認識による3Dジェスチャー生成における人的評価の実践について検討する。本稿では,広範に使用されているBEAT2モーションキャプチャーデータセットの詳細な評価プロトコルを提案する。
参考スコア（独自算出の注目度）: 55.41250396114216
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We review human evaluation practices in automated, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is. In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models -- each trained by its original authors -- across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results provide strong evidence that 1) newer models do not consistently outperform earlier approaches; 2) published claims of high motion realism or speech-gesture alignment may not hold up under rigorous evaluation; and 3) the field must adopt disentangled assessments of motion quality and multimodal alignment for accurate benchmarking in order to make progress. Finally, in order to drive standardisation and enable new evaluation research, we will release five hours of synthetic motion from the benchmarked models; over 750 rendered video stimuli from the user studies -- enabling new evaluations without model reimplementation required -- alongside our open-source rendering script, and the 16,000 pairwise human preference votes collected for our benchmark.
Abstract（参考訳）: 音声による3次元ジェスチャーの自動生成における人間の評価手法を概観し, 標準化の欠如と, 欠陥のある実験装置の頻繁な利用を見出した。これは、異なるメソッドがどのように比較されるか、あるいは最先端が何かを知ることが不可能な状況につながる。評価設計の共通の欠点に対処し,ジェスチャー生成における将来のユーザ研究を標準化するために,広く使用されているBEAT2モーションキャプチャーデータセットの詳細な評価プロトコルを提案する。このプロトコルを用いて、動作リアリズムと音声・ジェスチャーアライメントの2つの主要な評価次元において、最近の6つのジェスチャー生成モデル(それぞれオリジナルの著者によって訓練された)をランク付けするために、大規模なクラウドソースによる評価を行う。私たちの結果は、強い証拠を与えます。 1) より新しいモデルは、以前のアプローチを一貫して上回るものではない。 2) 厳格な評価の下では,ハイモーションリアリズムや音声・姿勢アライメントの主張は成立しないかもしれない。 3) 現場では, 精度の高いベンチマークを行うためには, 運動品質とマルチモーダルアライメントの整合性を評価する必要がある。最後に、標準化を推進し、新しい評価研究を可能にするため、私たちは、ベンチマークされたモデルから5時間の合成動作、ユーザ研究から750時間以上レンダリングされたビデオ刺激 -- モデルの再実装を必要とせずに、新たな評価を可能にする -- を、オープンソースのレンダリングスクリプトとともにリリースします。

論文の概要: Gesture Generation (Still) Needs Improved Human Evaluation Practices: Insights from a Community-Driven State-of-the-Art Benchmark

関連論文リスト