Fugu-MT 論文翻訳(概要): StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

論文の概要: StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

arxiv url: http://arxiv.org/abs/2403.14773v1
Date: Thu, 21 Mar 2024 18:27:29 GMT
ステータス: 翻訳完了
システム内更新日: 2024-03-25 19:26:17.394516
Title: StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text
Title（参考訳）: StreamingT2V: テキストからの一貫性、動的、拡張可能なロングビデオ生成
Authors: Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, Humphrey Shi,
Abstract要約: 本稿では,80,240,600,1200以上のフレームをスムーズな遷移で自動回帰的に生成するStreamingT2Vを紹介する。私たちのコードは、https://github.com/Picsart-AI-Research/StreamingT2V.comで利用可能です。
参考スコア（独自算出の注目度）: 58.49820807662246
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-video diffusion models enable the generation of high-quality videos that follow text instructions, making it easy to create diverse and individual content. However, existing approaches mostly focus on high-quality short video generation (typically 16 or 24 frames), ending up with hard-cuts when naively extended to the case of long video synthesis. To overcome these limitations, we introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions. The key components are:(i) a short-term memory block called conditional attention module (CAM), which conditions the current generation on the features extracted from the previous chunk via an attentional mechanism, leading to consistent chunk transitions, (ii) a long-term memory block called appearance preservation module, which extracts high-level scene and object features from the first video chunk to prevent the model from forgetting the initial scene, and (iii) a randomized blending approach that enables to apply a video enhancer autoregressively for infinitely long videos without inconsistencies between chunks. Experiments show that StreamingT2V generates high motion amount. In contrast, all competing image-to-video methods are prone to video stagnation when applied naively in an autoregressive manner. Thus, we propose with StreamingT2V a high-quality seamless text-to-long video generator that outperforms competitors with consistency and motion. Our code will be available at: https://github.com/Picsart-AI-Research/StreamingT2V
Abstract（参考訳）: テキストからビデオへの拡散モデルにより、テキストの指示に従う高品質なビデオを生成することができ、多種多様な個別のコンテンツを簡単に作成できる。しかし、既存のアプローチは主に高品質のショートビデオ生成(典型的には16フレームか24フレーム)に重点を置いており、長いビデオ合成の場合に鼻で拡張するとハードカットとなる。この制限を克服するために,80,240,600,1200以上のフレームをスムーズな遷移で生成する,自動回帰方式のStreamingT2Vを導入する。主な構成要素は次のとおりである。 (i)コンディショナルアテンションモジュール(CAM)と呼ばれる短期記憶ブロックで、注意機構を介して前のチャンクから抽出した特徴の現在の生成を条件付けし、一貫したチャンク遷移を引き起こす。二外観保存モジュールという長期記憶ブロックであって、第1のビデオチャンクから高レベルなシーンやオブジェクトの特徴を抽出して、モデルが初期シーンを忘れないようにすること。三チャンク間の矛盾なく無限長ビデオに自動回帰的にビデオエンハンサーを適用することを可能とするランダム化ブレンディング手法。実験により、StreamingT2Vは高い運動量を生成することが示された。対照的に、競合する画像とビデオの手法は、自己回帰的に適用する場合、ビデオの停滞がちである。そこで我々はStreamingT2Vを提案する。これは高品質なテキスト・ツー・ロングビデオ・ジェネレータで、コンペティタを一貫性と動作で上回っている。私たちのコードは、https://github.com/Picsart-AI-Research/StreamingT2Vで利用可能になります。

論文の概要: StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

関連論文リスト