Fugu-MT 論文翻訳(概要): Hierarchical Codec Diffusion for Video-to-Speech Generation

論文の概要: Hierarchical Codec Diffusion for Video-to-Speech Generation

arxiv url: http://arxiv.org/abs/2604.15923v1
Date: Fri, 17 Apr 2026 10:28:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-20 22:00:19.876654
Title: Hierarchical Codec Diffusion for Video-to-Speech Generation
Title（参考訳）: 音声合成のための階層型コーデック拡散
Authors: Jiaxin Ye, Gaoxiang Cong, Chenhui Wang, Xin-Cheng Wen, Zhaoyang Li, Boyuan Cao, Hongming Shan,
Abstract要約: VTS(Video-to-Speech)の生成は、聴覚信号なしでサイレントビデオから音声を合成することを目的としている。既存のVTS手法は、粗い話者認識のセマンティクスからきめ細かい韻律的詳細にまたがる音声の階層性を無視している。我々は、離散音声トークンの固有の階層構造を利用して、強力な音声・視覚的アライメントを実現する新しい階層型コーデックトランスであるHiCoDiTを提案する。
参考スコア（独自算出の注目度）: 34.08427878034203
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Video-to-Speech (VTS) generation aims to synthesize speech from a silent video without auditory signals. However, existing VTS methods disregard the hierarchical nature of speech, which spans coarse speaker-aware semantics to fine-grained prosodic details. This oversight hinders direct alignment between visual and speech features at specific hierarchical levels during property matching. In this paper, leveraging the hierarchical structure of Residual Vector Quantization (RVQ)-based codec, we propose HiCoDiT, a novel Hierarchical Codec Diffusion Transformer that exploits the inherent hierarchy of discrete speech tokens to achieve strong audio-visual alignment. Specifically, since lower-level tokens encode coarse speaker-aware semantics and higher-level tokens capture fine-grained prosody, HiCoDiT employs low-level and high-level blocks to generate tokens at different levels. The low-level blocks condition on lip-synchronized motion and facial identity to capture speaker-aware content, while the high-level blocks use facial expression to modulate prosodic dynamics. Finally, to enable more effective coarse-to-fine conditioning, we propose a dual-scale adaptive instance layer normalization that jointly captures global vocal style through channel-wise normalization and local prosody dynamics through temporal-wise normalization. Extensive experiments demonstrate that HiCoDiT outperforms baselines in fidelity and expressiveness, highlighting the potential of discrete modelling for VTS. The code and speech demo are both available at https://github.com/Jiaxin-Ye/HiCoDiT.
Abstract（参考訳）: VTS(Video-to-Speech)の生成は、聴覚信号なしでサイレントビデオから音声を合成することを目的としている。しかし、既存のVTS法は、粗い話者認識のセマンティクスから細かな韻律的詳細まで、音声の階層性を無視している。この監視は、プロパティマッチング中に特定の階層レベルで視覚的特徴と音声的特徴の直接的な一致を妨げる。本稿では、Residual Vector Quantization(RVQ)ベースのコーデックの階層構造を利用して、離散音声トークンの固有の階層構造を利用して、強い音声・視覚的アライメントを実現する新しい階層型コーデック拡散変換器HiCoDiTを提案する。特に、低レベルのトークンが粗い話者認識セマンティクスを符号化し、高レベルのトークンがきめ細かな韻律をキャプチャするため、HiCoDiTは低レベルのブロックと高レベルのブロックを使用して異なるレベルのトークンを生成する。低レベルブロックは唇同期動作の条件と、話者認識コンテンツをキャプチャするための顔認証を、高レベルブロックは表情を使って韻律力学を変調する。最後に, チャネルワイド正規化と時間ワイド正規化による局所韻律ダイナミクスにより, グローバルな声帯スタイルを同時キャプチャする2段階適応型インスタンス層正規化を提案する。広汎な実験により、HiCoDiTは、VTSの離散モデリングの可能性を強調し、忠実度と表現性においてベースラインより優れることが示された。コードと音声のデモはhttps://github.com/Jiaxin-Ye/HiCoDiT.comで公開されている。

論文の概要: Hierarchical Codec Diffusion for Video-to-Speech Generation

関連論文リスト