$^R$FLAV: Rolling Flow matching for infinite Audio Video generation
- URL: http://arxiv.org/abs/2503.08307v2
- Date: Wed, 12 Mar 2025 08:48:33 GMT
- Title: $^R$FLAV: Rolling Flow matching for infinite Audio Video generation
- Authors: Alex Ergasti, Giuseppe Gabriele Tarollo, Filippo Botti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati,
- Abstract summary: Joint audio-video (AV) generation is still a significant challenge in generative AI.<n>We present $R$-FLAV, a novel transformer-based architecture that addresses all the key challenges of AV generation.<n>Our experimental results demonstrate that $R$-FLAV outperforms existing state-of-the-art models in multimodal AV generation tasks.
- Score: 5.7858802690354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Joint audio-video (AV) generation is still a significant challenge in generative AI, primarily due to three critical requirements: quality of the generated samples, seamless multimodal synchronization and temporal coherence, with audio tracks that match the visual data and vice versa, and limitless video duration. In this paper, we present $^R$-FLAV, a novel transformer-based architecture that addresses all the key challenges of AV generation. We explore three distinct cross modality interaction modules, with our lightweight temporal fusion module emerging as the most effective and computationally efficient approach for aligning audio and visual modalities. Our experimental results demonstrate that $^R$-FLAV outperforms existing state-of-the-art models in multimodal AV generation tasks. Our code and checkpoints are available at https://github.com/ErgastiAlex/R-FLAV.
Related papers
- Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks [21.710127132217526]
We introduce a new paradigm called Video Interface Networks (VINs), which augment DiTs with an abstraction module to enable parallel inference of video chunks.
VINs encode global semantics from the noisy input of local chunks and the encoded representations, in turn, guide DiTs in denoising chunks in parallel.
Our approach attains state-of-the-art motion smoothness while using 25-40% fewer FLOPs than full generation.
arXiv Detail & Related papers (2025-03-21T21:13:02Z) - UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation [44.21422404659117]
UniForm is a unified multi-task diffusion transformer that jointly generates audio and visual modalities in a shared latent space.
A single diffusion process models both audio and video, capturing the inherent correlations between sound and vision.
By leveraging large language models and a large-scale text-audio-video combined dataset, UniForm achieves greater generative diversity than prior approaches.
arXiv Detail & Related papers (2025-02-06T09:18:30Z) - AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation [62.682428307810525]
We introduce AVS-Mamba, a selective state space model to address the audio-visual segmentation task.
Our framework incorporates two key components for video understanding and cross-modal learning.
Our approach achieves new state-of-the-art results on the AVSBench-object and AVS-semantic datasets.
arXiv Detail & Related papers (2025-01-14T03:20:20Z) - AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation [49.6922496382879]
We propose a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation.
The key to our framework is a Fusion Block that facilitates bidirectional information exchange between video and audio diffusion models.
arXiv Detail & Related papers (2024-12-19T18:57:21Z) - OMCAT: Omni Context Aware Transformer [27.674943980306423]
OCTAV is a novel dataset designed to capture event transitions across audio and video.
OMCAT is a powerful model that leverages RoTE to enhance temporal grounding and computational efficiency in time-anchored tasks.
Our model demonstrates state-of-the-art performance on Audio-Visual Question Answering (AVQA) tasks and the OCTAV benchmark, showcasing significant gains in temporal reasoning and cross-modal alignment.
arXiv Detail & Related papers (2024-10-15T23:16:28Z) - SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models.
Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.<n>We propose Frieren, a V2A model based on rectified flow matching.<n>Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities [67.89368528234394]
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities.
Video and audio are obtained at much higher rates than text and are roughly aligned in time.
Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models.
arXiv Detail & Related papers (2023-11-09T19:15:12Z) - Zorro: the masked multimodal transformer [68.99684436029884]
Zorro is a technique that uses masks to control how inputs from each modality are routed inside Transformers.
We show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks.
arXiv Detail & Related papers (2023-01-23T17:51:39Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.