MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation
- URL: http://arxiv.org/abs/2405.17842v2
- Date: Tue, 25 Feb 2025 09:34:26 GMT
- Title: MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation
- Authors: Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji,
- Abstract summary: This study aims to construct an audio-video generative model with minimal computational cost.<n>We propose a novel method that guides single-modal models to cooperatively generate well-aligned samples across modalities.<n> Empirical evaluations on several benchmark datasets demonstrate that our method improves both single-modal fidelity and multimodal alignment with relatively few parameters.
- Score: 15.29891397291197
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study aims to construct an audio-video generative model with minimal computational cost by leveraging pre-trained single-modal generative models for audio and video. To achieve this, we propose a novel method that guides single-modal models to cooperatively generate well-aligned samples across modalities. Specifically, given two pre-trained base diffusion models, we train a lightweight joint guidance module to adjust scores separately estimated by the base models to match the score of joint distribution over audio and video. We show that this guidance can be computed using the gradient of the optimal discriminator, which distinguishes real audio-video pairs from fake ones independently generated by the base models. Based on this analysis, we construct a joint guidance module by training this discriminator. Additionally, we adopt a loss function to stabilize the discriminator's gradient and make it work as a noise estimator, as in standard diffusion models. Empirical evaluations on several benchmark datasets demonstrate that our method improves both single-modal fidelity and multimodal alignment with relatively few parameters. The code is available at: https://github.com/SonyResearch/MMDisCo.
Related papers
- Feynman-Kac Correctors in Diffusion: Annealing, Guidance, and Product of Experts [64.34482582690927]
We provide an efficient and principled method for sampling from a sequence of annealed, geometric-averaged, or product distributions derived from pretrained score-based models.
We propose Sequential Monte Carlo (SMC) resampling algorithms that leverage inference-time scaling to improve sampling quality.
arXiv Detail & Related papers (2025-03-04T17:46:51Z) - Generative Lines Matching Models [2.6089354079273512]
We propose a new probability flow model, which matches globally straight lines interpolating the two distributions.
We demonstrate that the flow fields produced by the LMM exhibit notable temporal consistency, resulting in trajectories with excellent straightness scores.
Overall, the LMM achieves state-of-the-art FID scores with minimal NFEs on established benchmark datasets.
arXiv Detail & Related papers (2024-12-09T11:33:38Z) - On Sampling Strategies for Spectral Model Sharding [7.185534285278903]
In this work, we present two sampling strategies for such sharding.
The first produces unbiased estimators of the original weights, while the second aims to minimize the squared approximation error.
We demonstrate that both of these methods can lead to improved performance on various commonly used datasets.
arXiv Detail & Related papers (2024-10-31T16:37:25Z) - A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation [15.29891397291197]
Given base diffusion models for audio and video, we integrate them with additional modules into a single model and train it to make the model jointly generate audio and video.
To enhance alignment between audio-video pairs, we introduce two novel mechanisms in our model.
arXiv Detail & Related papers (2024-09-26T05:39:52Z) - CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling [21.380988939240844]
We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio.
We propose a joint contrastive training loss to improve the synchronization between visual and auditory occurrences.
arXiv Detail & Related papers (2023-12-08T23:55:19Z) - Ensemble Modeling for Multimodal Visual Action Recognition [50.38638300332429]
We propose an ensemble modeling approach for multimodal action recognition.
We independently train individual modality models using a variant of focal loss tailored to handle the long-tailed distribution of the MECCANO [21] dataset.
arXiv Detail & Related papers (2023-08-10T08:43:20Z) - Reflected Diffusion Models [93.26107023470979]
We present Reflected Diffusion Models, which reverse a reflected differential equation evolving on the support of the data.
Our approach learns the score function through a generalized score matching loss and extends key components of standard diffusion models.
arXiv Detail & Related papers (2023-04-10T17:54:38Z) - Separate And Diffuse: Using a Pretrained Diffusion Model for Improving
Source Separation [99.19786288094596]
We show how the upper bound can be generalized to the case of random generative models.
We show state-of-the-art results on 2, 3, 5, 10, and 20 speakers on multiple benchmarks.
arXiv Detail & Related papers (2023-01-25T18:21:51Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z) - Model ensemble instead of prompt fusion: a sample-specific knowledge
transfer method for few-shot prompt tuning [85.55727213502402]
We focus on improving the few-shot performance of prompt tuning by transferring knowledge from soft prompts of source tasks.
We propose Sample-specific Ensemble of Source Models (SESoM)
SESoM learns to adjust the contribution of each source model for each target sample separately when ensembling source model outputs.
arXiv Detail & Related papers (2022-10-23T01:33:16Z) - Unsupervised Audio Source Separation Using Differentiable Parametric
Source Models [8.80867379881193]
We propose an unsupervised model-based deep learning approach to musical source separation.
A neural network is trained to reconstruct the observed mixture as a sum of the sources.
The experimental evaluation on a vocal ensemble separation task shows that the proposed method outperforms learning-free methods.
arXiv Detail & Related papers (2022-01-24T11:05:30Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating
and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion.
Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z) - Speech Prediction in Silent Videos using Variational Autoencoders [29.423462898526605]
We present a model for generating speech in a silent video.
The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory's conditional distribution.
We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
arXiv Detail & Related papers (2020-11-14T17:09:03Z) - Unified Gradient Reweighting for Model Biasing with Applications to
Source Separation [27.215800308343322]
We propose a simple, unified gradient reweighting scheme to bias the learning process of a model and steer it towards a certain distribution of results.
We apply this method to various source separation tasks, in order to shift the operating point of the models towards different objectives.
Our framework enables the user to control a robustness trade-off between worst and average performance.
arXiv Detail & Related papers (2020-10-25T21:41:45Z) - Learning Diverse Representations for Fast Adaptation to Distribution
Shift [78.83747601814669]
We present a method for learning multiple models, incorporating an objective that pressures each to learn a distinct way to solve the task.
We demonstrate our framework's ability to facilitate rapid adaptation to distribution shift.
arXiv Detail & Related papers (2020-06-12T12:23:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.