Fugu-MT 論文翻訳(概要): TempoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation

論文の概要: TempoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation

arxiv url: http://arxiv.org/abs/2603.06057v1
Date: Fri, 06 Mar 2026 09:09:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:45.478814
Title: TempoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation
Title（参考訳）: TempoSyncDiff:低レイテンシ音声駆動トーキングヘッド生成のための蒸留時間連続拡散
Authors: Soumya Mazumdar, Vineet Kumar Rakesh,
Abstract要約: 本稿では,参照条件付き潜在拡散フレームワークであるTempoSyncDiffを紹介する。効率的な音声駆動音声ヘッド生成のための数ステップの推論を探索する。このフレームワークはアイデンティティアンカーと、アイデンティティドリフトとフレーム間フリックを緩和するために設計された時間的正規化を備えている。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Diffusion models have recently advanced photorealistic human synthesis, although practical talking-head generation (THG) remains constrained by high inference latency, temporal instability such as flicker and identity drift, and imperfect audio-visual alignment under challenging speech conditions. This paper introduces TempoSyncDiff, a reference-conditioned latent diffusion framework that explores few-step inference for efficient audio-driven talking-head generation. The approach adopts a teacher-student distillation formulation in which a diffusion teacher trained with a standard noise prediction objective guides a lightweight student denoiser capable of operating with significantly fewer inference steps to improve generation stability. The framework incorporates identity anchoring and temporal regularization designed to mitigate identity drift and frame-to-frame flicker during synthesis, while viseme-based audio conditioning provides coarse lip motion control. Experiments on the LRS3 dataset report denoising-stage component-level metrics relative to VAE reconstructions and preliminary latency characterization, including CPU-only and edge computing measurements and feasibility estimates for edge deployment. The results suggest that distilled diffusion models can retain much of the reconstruction behaviour of a stronger teacher while enabling substantially lower latency inference. The study is positioned as an initial step toward practical diffusion-based talking-head generation under constrained computational settings. GitHub: https://mazumdarsoumya.github.io/TempoSyncDiff
Abstract（参考訳）: 拡散モデルは近年,高度なフォトリアリスティックな人間の合成が進んでいるが,実用的な音声頭生成(THG)は高い推論遅延,フリックやアイデンティティドリフトなどの時間的不安定性,難解な音声条件下での音声・視覚的アライメントに制約されている。本稿では,参照条件付き遅延拡散フレームワークであるTempoSyncDiffを紹介する。提案手法は,標準雑音予測目標を用いた拡散教師が,推定ステップを著しく小さくして動作し,生成安定性を向上させる軽量な学生デノイザーを指導する,教師・学生の蒸留形式を採用する。このフレームワークにはアイデンティティアンカーと時間正規化が組み込まれており、合成中にアイデンティティドリフトとフレーム間フリックを緩和し、ビセムベースのオーディオコンディショニングは粗い唇の動き制御を提供する。 LRS3データセットのレポートでは、VAEの再構築と、CPUのみとエッジコンピューティングの測定、エッジデプロイメントの実現可能性見積を含む、事前のレイテンシ特性に関する、段階的なコンポーネントレベルのメトリクスが報告されている。その結果, 蒸留拡散モデルでは, より強い教師の復元行動の多くを維持でき, 遅延推定を著しく低減できることがわかった。本研究は,制約付き計算環境下での実践的拡散型トーキングヘッド生成に向けた最初のステップとして位置づけられている。 GitHub:https://mazumdarsoumya.github.io/TempoSyncDiff

論文の概要: TempoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation

関連論文リスト