Fugu-MT 論文翻訳(概要): Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization

論文の概要: Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization

arxiv url: http://arxiv.org/abs/2506.15980v1
Date: Thu, 19 Jun 2025 02:56:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-23 19:00:04.919845
Title: Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization
Title（参考訳）: 圧縮・量子化マルチコンディショントークン化による手話映像の高度生成
Authors: Cong Wang, Zexuan Deng, Zhiwei Jiang, Fei Shen, Yafeng Yin, Shiwei Gan, Zifeng Cheng, Shiping Ge, Qing Gu,
Abstract要約: SignViPは、複数のきめ細かい条件を組み込んだ新しいフレームワークである。 SignViPは、ビデオ品質の時間的コヒーレンスやセマンティクスの忠実さなど、メトリクス間の最先端のパフォーマンスを実現している。
参考スコア（独自算出の注目度）: 13.619845845897947
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sign Language Video Generation (SLVG) seeks to generate identity-preserving sign language videos from spoken language texts. Existing methods primarily rely on the single coarse condition (\eg, skeleton sequences) as the intermediary to bridge the translation model and the video generation model, which limits both the naturalness and expressiveness of the generated videos. To overcome these limitations, we propose SignViP, a novel SLVG framework that incorporates multiple fine-grained conditions for improved generation fidelity. Rather than directly translating error-prone high-dimensional conditions, SignViP adopts a discrete tokenization paradigm to integrate and represent fine-grained conditions (\ie, fine-grained poses and 3D hands). SignViP contains three core components. (1) Sign Video Diffusion Model is jointly trained with a multi-condition encoder to learn continuous embeddings that encapsulate fine-grained motion and appearance. (2) Finite Scalar Quantization (FSQ) Autoencoder is further trained to compress and quantize these embeddings into discrete tokens for compact representation of the conditions. (3) Multi-Condition Token Translator is trained to translate spoken language text to discrete multi-condition tokens. During inference, Multi-Condition Token Translator first translates the spoken language text into discrete multi-condition tokens. These tokens are then decoded to continuous embeddings by FSQ Autoencoder, which are subsequently injected into Sign Video Diffusion Model to guide video generation. Experimental results show that SignViP achieves state-of-the-art performance across metrics, including video quality, temporal coherence, and semantic fidelity. The code is available at https://github.com/umnooob/signvip/.
Abstract（参考訳）: 手話ビデオ生成(SLVG)は、音声言語テキストから手話ビデオを生成することを目的としている。既存の手法は主に、翻訳モデルとビデオ生成モデルを橋渡しする仲介者として、単一の粗い条件 (\eg, skeleton sequences) に依存しており、生成したビデオの自然性と表現性の両方を制限する。これらの制約を克服するため、我々は、複数のきめ細かい条件を組み込んだ新しいSLVGフレームワークであるSignViPを提案する。 SignViPは、エラーを起こしやすい高次元の条件を直接翻訳するのではなく、個別のトークン化パラダイムを採用して、きめ細かい条件(細かなポーズと3Dハンド)を統合し、表現する。 SignViPには3つのコアコンポーネントが含まれている。 1) サイン映像拡散モデルとマルチコンディションエンコーダを併用し, 微粒な動きや外観を包含する連続埋め込みを学習する。 2) 有限スカラー量子化(FSQ) オートエンコーダはさらに、これらの埋め込みを離散トークンに圧縮して量子化し、条件のコンパクトな表現を行うように訓練されている。 (3)マルチコンディショントークン変換器は,音声テキストを個別のマルチコンディショントークンに変換するように訓練されている。推論中、マルチコンディショントークン変換器はまず、音声言語テキストを個別のマルチコンディショントークンに変換する。これらのトークンはFSQ Autoencoderによって連続的な埋め込みにデコードされ、ビデオ生成をガイドするためにSign Video Diffusion Modelに注入される。実験の結果,SignViPはビデオ品質,時間的コヒーレンス,意味的忠実度など,さまざまな指標で最先端のパフォーマンスを実現していることがわかった。コードはhttps://github.com/umnooob/signvip/.comで公開されている。

論文の概要: Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization

関連論文リスト