Related papers: PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation

PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation

URL: http://arxiv.org/abs/2511.18833v3
Date: Thu, 27 Nov 2025 07:35:14 GMT
Title: PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation
Authors: Huadai Liu, Kaicheng Luo, Wen Wang, Qian Chen, Peiwen Sun, Rongjie Huang, Xiangang Li, Jieping Ye, Wei Xue,
Abstract summary: Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions.<n>We introduce PrismAudio, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain-of-Thought (CoT) planning.
Score: 57.864929968616586
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single loss functions and lack human preference alignment. We introduce PrismAudio, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain-of-Thought (CoT) planning. Our approach decomposes monolithic reasoning into four specialized CoT modules (Semantic, Temporal, Aesthetic, and Spatial CoT), each paired with targeted reward functions. This CoT-reward correspondence enables multidimensional RL optimization that guides the model to jointly generate better reasoning across all perspectives, solving the objective entanglement problem while preserving interpretability. To make this optimization computationally practical, we propose Fast-GRPO, which employs hybrid ODE-SDE sampling that dramatically reduces the training overhead compared to existing GRPO implementations. We also introduce AudioCanvas, a rigorous benchmark that is more distributionally balanced and covers more realistically diverse and challenging scenarios than existing datasets, with 300 single-event classes and 501 multi-event samples. Experimental results demonstrate that PrismAudio achieves state-of-the-art performance across all four perceptual dimensions on both the in-domain VGGSound test set and out-of-domain AudioCanvas benchmark. The project page is available at https://PrismAudio-Project.github.io.

Related papers

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation [112.614973927778]
Joint audio-video generation (JAVG) produces synchronized and semantically aligned sound and vision from textual descriptions.<n>This paper presents JavisDiT++, a framework for unified modeling and optimization of JAVG.<n>Our model achieves state-of-the-art performance merely with around 1M public training entries.
arXiv Detail & Related papers (2026-02-22T12:44:28Z)
Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z)
Omni2Sound: Towards Unified Video-Text-to-Audio Generation [56.11583645408007]
Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A) and joint video-text-to-audio (VT2A) generation offers significant application flexibility.<n>SoundAtlas is a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality.<n>We propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities.
arXiv Detail & Related papers (2026-01-06T05:49:41Z)
AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch [0.3728263002609659]
This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT)<n>AMAuT eliminates the dependency on pre-trained weights while supporting arbitrary sample rates and audio lengths.<n> Experiments on five public benchmarks, AudioMNIST, SpeechCommands V1 & V2, VocalSound, and CochlScene, show that AMAuT achieves accuracies up to 99.8%.
arXiv Detail & Related papers (2025-10-22T08:41:59Z)
MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization [10.717164013707693]
Current video-to-audio (V2A) methods struggle in complex multi-event scenarios.<n>This study proposes a novel V2A framework: MultiSoundGen.<n>It introduces direct preference optimization (DPO) into the V2A domain.
arXiv Detail & Related papers (2025-09-24T11:04:34Z)
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment [76.32508013503653]
We propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning.<n>We tackle the mismatch between modalities by treating audio as a temporal sequence aligned with video frames, rather than using global representations.<n>We improve spatial localization by introducing learnable register tokens that reduce semantic load on patch tokens.
arXiv Detail & Related papers (2025-05-02T12:59:58Z)
DITTO: Diffusion Inference-Time T-Optimization for Music Generation [49.90109850026932]
Diffusion Inference-Time T-Optimization (DITTO) is a frame-work for controlling pre-trained text-to-music diffusion models at inference-time. We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control.
arXiv Detail & Related papers (2024-01-22T18:10:10Z)
High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.