Related papers: DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model

DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model

URL: http://arxiv.org/abs/2512.24408v1
Date: Tue, 30 Dec 2025 18:43:38 GMT
Title: DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model
Authors: Bohong Chen, Haiyang Liu,
Abstract summary: DyStream is a flow matching-based autoregressive model that could generate video in real-time from both speaker and listener audio.<n>It can generate video within 34 ms per frame, guaranteeing the entire system latency remains under 100 ms.<n>It achieves state-of-the-art lip-sync quality, with offline and online LipSync Confidence scores of 8.13 and 7.61 on HDTF.
Score: 7.852008880859938
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating realistic, dyadic talking head video requires ultra-low latency. Existing chunk-based methods require full non-causal context windows, introducing significant delays. This high latency critically prevents the immediate, non-verbal feedback required for a realistic listener. To address this, we present DyStream, a flow matching-based autoregressive model that could generate video in real-time from both speaker and listener audio. Our method contains two key designs: (1) we adopt a stream-friendly autoregressive framework with flow-matching heads for probabilistic modeling, and (2) We propose a causal encoder enhanced by a lookahead module to incorporate short future context (e.g., 60 ms) to improve quality while maintaining low latency. Our analysis shows this simple-and-effective method significantly surpass alternative causal strategies, including distillation and generative encoder. Extensive experiments show that DyStream could generate video within 34 ms per frame, guaranteeing the entire system latency remains under 100 ms. Besides, it achieves state-of-the-art lip-sync quality, with offline and online LipSync Confidence scores of 8.13 and 7.61 on HDTF, respectively. The model, weights and codes are available.

Related papers

Voxtral Realtime [134.66962524291424]
Voxtral Realtime is a streaming automatic speech recognition model.<n>It matches offline transcription quality at sub-second latency.<n>We release the model weights under the Apache 2.0 license.
arXiv Detail & Related papers (2026-02-11T19:17:10Z)
DiffVC-RT: Towards Practical Real-Time Diffusion-based Perceptual Neural Video Compression [38.495966630021556]
We present DiffVC-RT, the first framework designed to achieve real-time diffusion-based Neural Video Compression (NVC)<n>We show that DiffVC-RT achieves 80.1% perceptual savings in terms of LPIPS over VTM-17.0 on HEVC dataset with real-time encoding and decoding speeds of 206 / 30 fps for 720p videos on an NVIDIA H800 GPU.
arXiv Detail & Related papers (2026-01-28T12:59:25Z)
Real-Time Streamable Generative Speech Restoration with Flow Matching [35.33575179870606]
Stream$.$FM is a frame-causal flow-based generative model with an algorithmic latency of 32 milliseconds (ms) and a total latency of 48 ms.<n>We show that high-quality streaming generative speech processing can be realized on consumer GPUs available today.
arXiv Detail & Related papers (2025-12-22T14:41:17Z)
StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation [65.90400162290057]
Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered.<n>Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation.<n>Live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter.
arXiv Detail & Related papers (2025-11-10T18:51:28Z)
MotionStream: Real-Time Video Generation with Interactive Motion Controls [60.403597895657505]
We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU.<n>Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly.<n>Our models achieve state-of-the-art results in motion following and video quality while being two orders of magnitude faster, uniquely enabling infinite-length streaming.
arXiv Detail & Related papers (2025-11-03T06:37:53Z)
Diffusion Buffer for Online Generative Speech Enhancement [32.98694610706198]
Diffusion Buffer is a generative diffusion-based Speech Enhancement model.<n>It only requires one neural network call per incoming signal frame from a stream of data.<n>It performs enhancement in an online fashion on a consumer-grade GPU.
arXiv Detail & Related papers (2025-10-21T15:52:33Z)
SoundReactor: Frame-level Online Video-to-Audio Generation [39.113214321291586]
Video-to-Audio generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand.<n>We introduce the novel task of frame-level online V2A generation, where a model autoregressively generates audio from video without access to future video frames.<n>SoundReactor is the first simple yet effective framework explicitly tailored for this task.
arXiv Detail & Related papers (2025-10-02T15:18:00Z)
READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z)
BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models [62.38713281234756]
Binaural rendering pipeline aims to synthesize audio that mimics natural hearing based on a mono audio.<n>Many methods have been proposed to solve this problem, but they struggle with rendering quality and streamable inference.<n>We propose a flow matching based streaming speech framework called BinauralFlow synthesis framework.
arXiv Detail & Related papers (2025-05-28T20:59:15Z)
StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation [52.56469577812338]
We introduce StreamDiffusion, a real-time diffusion pipeline for interactive image generation.<n>Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction.<n>We present a novel approach that transforms the original sequential denoising into the denoising process.
arXiv Detail & Related papers (2023-12-19T18:18:33Z)
FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis [77.06890315052563]
We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency. Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
arXiv Detail & Related papers (2022-07-08T10:10:39Z)
Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition [19.971343876930767]
We present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the encoder are modified. Experiments on the open 170-hour AISHELL-1 dataset show that, the proposed method can unify the streaming and non-streaming model simply and efficiently.
arXiv Detail & Related papers (2020-12-10T06:54:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.