Related papers: Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video

Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video

URL: http://arxiv.org/abs/2510.21581v1
Date: Fri, 24 Oct 2025 15:49:54 GMT
Title: Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video
Authors: Ciara Rowles, Varun Jampani, Simon Donné, Shimon Vainer, Julian Parker, Zach Evans,
Abstract summary: Foley Control is a lightweight approach to video-guided Foley.<n>It keeps pretrained single-modality models frozen and learns only a small cross-attention bridge between them.<n>Although we focus on Video-to-Foley, the same bridge design can potentially extend to other audio modalities.
Score: 39.74394488889939
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Foley Control is a lightweight approach to video-guided Foley that keeps pretrained single-modality models frozen and learns only a small cross-attention bridge between them. We connect V-JEPA2 video embeddings to a frozen Stable Audio Open DiT text-to-audio (T2A) model by inserting compact video cross-attention after the model's existing text cross-attention, so prompts set global semantics while video refines timing and local dynamics. The frozen backbones retain strong marginals (video; audio given text) and the bridge learns the audio-video dependency needed for synchronization -- without retraining the audio prior. To cut memory and stabilize training, we pool video tokens before conditioning. On curated video-audio benchmarks, Foley Control delivers competitive temporal and semantic alignment with far fewer trainable parameters than recent multi-modal systems, while preserving prompt-driven controllability and production-friendly modularity (swap/upgrade encoders or the T2A backbone without end-to-end retraining). Although we focus on Video-to-Foley, the same bridge design can potentially extend to other audio modalities (e.g., speech).

Related papers

LTX-2: Efficient Joint Audio-Visual Foundation Model [3.1804093402153506]
LTX-2 is an open-source model capable of generating temporally synchronized audiovisual content.<n>We employ a multilingual text encoder for broader prompt understanding.<n>LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene.
arXiv Detail & Related papers (2026-01-06T18:24:41Z)
Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation [20.446421146630474]
We introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising.<n>Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony.
arXiv Detail & Related papers (2025-12-02T06:31:38Z)
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation [5.304004483404346]
Ovi is a unified paradigm for audio-video generation that models the two modalities as a single generative process.<n>Trained from scratch on hundreds of thousands of hours of raw audio, the audio tower learns to generate realistic sound effects.<n>Our model enables cinematic storytelling with natural speech and accurate, context-matched sound effects, producing movie-grade video clips.
arXiv Detail & Related papers (2025-09-30T21:03:50Z)
AudioStory: Generating Long-Form Narrative Audio with Large Language Models [87.23256929520743]
AudioStory is a framework that integrates large language models with text-to-audio systems to generate structured, long-form audio narratives.<n>It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues.<n>Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity.
arXiv Detail & Related papers (2025-08-27T17:55:38Z)
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis [56.01110988816489]
We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio.<n> MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples.<n> MMAudio achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance.
arXiv Detail & Related papers (2024-12-19T18:59:55Z)
AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation [49.6922496382879]
We propose a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation.<n>The key to our framework is a Fusion Block that facilitates bidirectional information exchange between video and audio diffusion models.
arXiv Detail & Related papers (2024-12-19T18:57:21Z)
Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound [19.694770666874827]
Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video.<n>Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges.<n>We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as an intuitive condition with semantic timbre prompts.
arXiv Detail & Related papers (2024-08-21T18:06:15Z)
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds [14.636030346325578]
We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience. We propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation. One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents.
arXiv Detail & Related papers (2024-07-01T17:35:56Z)
STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment [61.83340833859382]
Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks. This is a nontemporal problem and poses two critical challenges: sparse-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations. We propose a continual audio-video pre-training method with two novel ideas.
arXiv Detail & Related papers (2023-10-12T10:50:21Z)
Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning. We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z)
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously. MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.