Related papers: VABench: A Comprehensive Benchmark for Audio-Video Generation

VABench: A Comprehensive Benchmark for Audio-Video Generation

URL: http://arxiv.org/abs/2512.09299v1
Date: Wed, 10 Dec 2025 03:57:29 GMT
Title: VABench: A Comprehensive Benchmark for Audio-Video Generation
Authors: Daili Hua, Xizhi Wang, Bohan Zeng, Xinyi Huang, Hao Liang, Junbo Niu, Xinlong Chen, Quanqing Xu, Wentao Zhang,
Abstract summary: VABench is a benchmark framework designed to evaluate the capabilities of synchronous audio-video generation.<n>It covers three task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation.<n>VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds.
Score: 22.00633729850902
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.

Related papers

MOVA: Towards Scalable and Synchronized Video-Audio Generation [91.56945636522345]
We introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content.<n>By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators.
arXiv Detail & Related papers (2026-02-09T15:31:54Z)
ALIVE: Animate Your World with Lifelike Audio-Video Generation [50.693986608051716]
ALIVE is a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation.<n>To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch.<n>ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions.
arXiv Detail & Related papers (2026-02-09T14:06:03Z)
ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation [55.76423101183408]
ViSAudio is an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture.<n>It generates high-quality audio with spatial immersion that adapts to viewpoint changes, sound-source motion, and diverse acoustic environments.
arXiv Detail & Related papers (2025-12-02T18:56:12Z)
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [47.14083940177122]
ThinkSound is a novel framework that enables stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement, and targeted editing.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z)
Video-to-Audio Generation with Hidden Alignment [27.11625918406991]
We offer insights into the video-to-audio generation paradigm, focusing on vision encoders, auxiliary embeddings, and data augmentation techniques.<n>We demonstrate our model exhibits state-of-the-art video-to-audio generation capabilities.
arXiv Detail & Related papers (2024-07-10T08:40:39Z)
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs [55.82090875098132]
VideoLLaMA 2 is a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks.
arXiv Detail & Related papers (2024-06-11T17:22:23Z)
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z)
Audio-Visual Synchronisation in the wild [149.84890978170174]
We identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync. We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length. We set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.
arXiv Detail & Related papers (2021-12-08T17:50:26Z)
NAViDAd: A No-Reference Audio-Visual Quality Metric Based on a Deep Autoencoder [0.0]
We propose a No-Reference Audio-Visual Quality Metric Based on a Deep Autoencoder (NAViDAd) The model is formed by a 2-layer framework that includes a deep autoencoder layer and a classification layer. The model performed well when tested against the UnB-AV and the LiveNetflix-II databases.
arXiv Detail & Related papers (2020-01-30T15:40:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.