Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video
- URL: http://arxiv.org/abs/2510.21581v1
- Date: Fri, 24 Oct 2025 15:49:54 GMT
- Title: Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video
- Authors: Ciara Rowles, Varun Jampani, Simon Donné, Shimon Vainer, Julian Parker, Zach Evans,
- Abstract summary: Foley Control is a lightweight approach to video-guided Foley.<n>It keeps pretrained single-modality models frozen and learns only a small cross-attention bridge between them.<n>Although we focus on Video-to-Foley, the same bridge design can potentially extend to other audio modalities.
- Score: 39.74394488889939
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Foley Control is a lightweight approach to video-guided Foley that keeps pretrained single-modality models frozen and learns only a small cross-attention bridge between them. We connect V-JEPA2 video embeddings to a frozen Stable Audio Open DiT text-to-audio (T2A) model by inserting compact video cross-attention after the model's existing text cross-attention, so prompts set global semantics while video refines timing and local dynamics. The frozen backbones retain strong marginals (video; audio given text) and the bridge learns the audio-video dependency needed for synchronization -- without retraining the audio prior. To cut memory and stabilize training, we pool video tokens before conditioning. On curated video-audio benchmarks, Foley Control delivers competitive temporal and semantic alignment with far fewer trainable parameters than recent multi-modal systems, while preserving prompt-driven controllability and production-friendly modularity (swap/upgrade encoders or the T2A backbone without end-to-end retraining). Although we focus on Video-to-Foley, the same bridge design can potentially extend to other audio modalities (e.g., speech).
Related papers
- LTX-2: Efficient Joint Audio-Visual Foundation Model [3.1804093402153506]
LTX-2 is an open-source model capable of generating temporally synchronized audiovisual content.<n>We employ a multilingual text encoder for broader prompt understanding.<n>LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene.
arXiv Detail & Related papers (2026-01-06T18:24:41Z) - Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation [20.446421146630474]
We introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising.<n>Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony.
arXiv Detail & Related papers (2025-12-02T06:31:38Z) - Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation [5.304004483404346]
Ovi is a unified paradigm for audio-video generation that models the two modalities as a single generative process.<n>Trained from scratch on hundreds of thousands of hours of raw audio, the audio tower learns to generate realistic sound effects.<n>Our model enables cinematic storytelling with natural speech and accurate, context-matched sound effects, producing movie-grade video clips.
arXiv Detail & Related papers (2025-09-30T21:03:50Z) - AudioStory: Generating Long-Form Narrative Audio with Large Language Models [87.23256929520743]
AudioStory is a framework that integrates large language models with text-to-audio systems to generate structured, long-form audio narratives.<n>It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues.<n>Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity.
arXiv Detail & Related papers (2025-08-27T17:55:38Z) - MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis [56.01110988816489]
We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio.<n> MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples.<n> MMAudio achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance.
arXiv Detail & Related papers (2024-12-19T18:59:55Z) - AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation [49.6922496382879]
We propose a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation.<n>The key to our framework is a Fusion Block that facilitates bidirectional information exchange between video and audio diffusion models.
arXiv Detail & Related papers (2024-12-19T18:57:21Z) - Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound [19.694770666874827]
Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video.<n>Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges.<n>We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as an intuitive condition with semantic timbre prompts.
arXiv Detail & Related papers (2024-08-21T18:06:15Z) - FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds [14.636030346325578]
We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience.
We propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation.
One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents.
arXiv Detail & Related papers (2024-07-01T17:35:56Z) - STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment [61.83340833859382]
Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks.
This is a nontemporal problem and poses two critical challenges: sparse-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations.
We propose a continual audio-video pre-training method with two novel ideas.
arXiv Detail & Related papers (2023-10-12T10:50:21Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.