JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching
- URL: http://arxiv.org/abs/2506.23552v1
- Date: Mon, 30 Jun 2025 06:51:40 GMT
- Title: JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching
- Authors: Mingi Kwon, Joonghyuk Shin, Jaeseok Jung, Jaesik Park, Youngjung Uh,
- Abstract summary: JAM-Flow is a unified framework to simultaneously synthesize and condition on both facial motion and speech.<n>It supports a wide array of conditioning inputs-including text, reference audio, and reference motion-facilitating tasks.
- Score: 30.02208748898321
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking, to enable effective cross-modal interaction while preserving modality-specific strengths. Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs-including text, reference audio, and reference motion-facilitating tasks such as synchronized talking head generation from text, audio-driven animation, and much more, within a single, coherent model. JAM-Flow significantly advances multi-modal generative modeling by providing a practical solution for holistic audio-visual synthesis. project page: https://joonghyuk.com/jamflow-web
Related papers
- AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation [24.799628787198397]
AudioGen- Omni generates high-fidelity audio, speech, and song coherently synchronized with the input video.<n>Joint training paradigm integrates large-scale video-text-audio corpora.<n>Dense frame-level representations are fused using an AdaLN-based joint attention mechanism.<n>With an inference time of 1.91 seconds for 8 seconds of audio, it offers substantial improvements in both efficiency and generality.
arXiv Detail & Related papers (2025-08-01T16:03:57Z) - Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model [43.533849239738394]
Stream- Omni is a large language-vision-speech model with efficient modality alignments.<n>It can simultaneously support interactions under various modality combinations.<n>It achieves strong performance on visual understanding, speech interaction, and vision-grounded speech interaction tasks.
arXiv Detail & Related papers (2025-06-16T16:06:45Z) - Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling [76.23539797803681]
Existing methods primarily use a look mechanism, relying on future text to achieve natural streaming speech synthesis.<n>We propose LE, a streaming framework for generating high-quality speech frame-by-frame.<n> Experimental results suggest that the LE outperforms current streaming TTS methods and achieves comparable performance over sentence-level TTS systems.
arXiv Detail & Related papers (2025-05-26T08:25:01Z) - AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars [65.53676584955686]
Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans.<n>We propose AsynFusion, a novel framework that leverages diffusion transformers to achieve cohesive expression and gesture synthesis.<n>AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations.
arXiv Detail & Related papers (2025-05-21T03:28:53Z) - OmniTalker: One-shot Real-time Text-Driven Talking Audio-Video Generation With Multimodal Style Mimicking [22.337906095079198]
We present OmniTalker, a unified framework that jointly generates synchronized talking audio-video content from input text.<n>Our framework adopts a dual-branch diffusion transformer (DiT) architecture, with one branch dedicated to audio generation and the other to video synthesis.
arXiv Detail & Related papers (2025-04-03T09:48:13Z) - Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers [58.86974149731874]
Cosh-DiT is a Co-speech gesture video system with hybrid Diffusion Transformers.<n>We introduce an audio Diffusion Transformer to synthesize expressive gesture dynamics synchronized with speech rhythms.<n>For realistic video synthesis conditioned on the generated speech-driven motion, we design a visual Diffusion Transformer.
arXiv Detail & Related papers (2025-03-13T01:36:05Z) - CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2.<n>Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens.<n>We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z) - OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows [21.677178476653385]
We introduce OmniFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis.<n>It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis.
arXiv Detail & Related papers (2024-12-02T06:13:01Z) - VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.<n>We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities [67.89368528234394]
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities.
Video and audio are obtained at much higher rates than text and are roughly aligned in time.
Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models.
arXiv Detail & Related papers (2023-11-09T19:15:12Z) - Unified speech and gesture synthesis using flow matching [24.2094371314481]
This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text.
The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures.
arXiv Detail & Related papers (2023-10-08T14:37:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.