JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1
- URL: http://arxiv.org/abs/2507.20987v2
- Date: Tue, 29 Jul 2025 04:13:25 GMT
- Title: JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1
- Authors: Xinhan Di, Kristin Qi, Pengqian Yu,
- Abstract summary: This paper introduces the Joint Whole-Body Talking Avatar and Speech Generation Version I(JWB-DH-V1)<n>It comprises a large-scale multi-modal dataset with 10,000 unique identities across 2 million video samples, and an evaluation protocol for assessing joint audio-video generation of whole-body animatable avatars.<n>Our evaluation of SOTA models reveals consistent performance disparities between face/hand-centric and whole-body performance.
- Score: 6.4645943969421875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in diffusion-based video generation have enabled photo-realistic short clips, but current methods still struggle to achieve multi-modal consistency when jointly generating whole-body motion and natural speech. Current approaches lack comprehensive evaluation frameworks that assess both visual and audio quality, and there are insufficient benchmarks for region-specific performance analysis. To address these gaps, we introduce the Joint Whole-Body Talking Avatar and Speech Generation Version I(JWB-DH-V1), comprising a large-scale multi-modal dataset with 10,000 unique identities across 2 million video samples, and an evaluation protocol for assessing joint audio-video generation of whole-body animatable avatars. Our evaluation of SOTA models reveals consistent performance disparities between face/hand-centric and whole-body performance, which incidates essential areas for future research. The dataset and evaluation tools are publicly available at https://github.com/deepreasonings/WholeBodyBenchmark.
Related papers
- HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding [79.06209664703258]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z) - Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation [20.009670139005085]
Existing ultrasound segmentation methods often struggle with adaptability to new tasks.<n>We introduce an adaptive framework that leverages the vision foundation model Hiera to extract multi-scale features.<n>These enriched features are then decoded to produce precise and robust segmentation.
arXiv Detail & Related papers (2025-03-31T17:47:42Z) - MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX [15.038202110401336]
MAVERIX(Multimodal Audio-Visual Evaluation Reasoning IndeX) is a novel benchmark with 700 videos and 2,556 questions.<n>It is designed to evaluate multimodal models through tasks that necessitate close integration of video and audio information.<n>Experiments with state-of-the-art models, including Gemini 1.5 Pro and o1, show performance approaching human levels.
arXiv Detail & Related papers (2025-03-27T17:04:33Z) - VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models [111.5892290894904]
VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions.
We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception.
VBench++ supports evaluating text-to-video and image-to-video.
arXiv Detail & Related papers (2024-11-20T17:54:41Z) - PEAVS: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores [18.26082503192707]
We develop a PEAVS (Perceptual Evaluation of Audio-Visual Synchrony) score, a novel automatic metric with a 5-point scale that evaluates the quality of audio-visual synchronization.
In our experiments, we observe a relative gain 50% over a natural extension of Fr'eche't based metrics for Audio-Visual synchrony.
arXiv Detail & Related papers (2024-04-10T20:32:24Z) - VBench: Comprehensive Benchmark Suite for Video Generative Models [100.43756570261384]
VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions.
We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception.
We will open-source VBench, including all prompts, evaluation methods, generated videos, and human preference annotations.
arXiv Detail & Related papers (2023-11-29T18:39:01Z) - AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection [49.81915942821647]
This study introduces the audio-visual transformer-based ensemble network (AVTENet) to detect deepfake videos.<n>For evaluation, we use the recently released benchmark multimodal audio-video FakeAVCeleb dataset.<n>For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models.
The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities.
The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z) - Towards Realistic Visual Dubbing with Heterogeneous Sources [22.250010330418398]
Few-shot visual dubbing involves synchronizing the lip movements with arbitrary speech input for any talking head.
We propose a simple yet efficient two-stage framework with a higher flexibility of mining heterogeneous data.
Our method makes it possible to independently utilize the training corpus for two-stage sub-networks.
arXiv Detail & Related papers (2022-01-17T07:57:24Z) - UniCon: Unified Context Network for Robust Active Speaker Detection [111.90529347692723]
We introduce a new efficient framework, the Unified Context Network (UniCon), for robust active speaker detection (ASD)
Our solution is a novel, unified framework that focuses on jointly modeling multiple types of contextual information.
A thorough ablation study is performed on several challenging ASD benchmarks under different settings.
arXiv Detail & Related papers (2021-08-05T13:25:44Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.