Fugu-MT 論文翻訳(概要): Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

論文の概要: Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

arxiv url: http://arxiv.org/abs/2502.10248v2
Date: Mon, 17 Feb 2025 08:58:33 GMT
ステータス: 翻訳完了
システム内更新日: 2025-02-18 14:02:28.012439
Title: Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Title（参考訳）: Step-Video-T2V Technical Report: The Practices, Challenges and Future of Video Foundation Model
Authors: Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang Yu, Dapeng Shi, Dingyuan Hu, Enle Liu, Gang Yu, Ge Yang, Guanzhe Huang, Gulin Yan, Haiyang Feng, Hao Nie, Haonan Jia, Hanpeng Hu, Hanqi Chen, Haolong Yan, Heng Wang, Hongcheng Guo, Huilin Xiong, Huixin Xiong, Jiahao Gong, Jianchang Wu, Jiaoren Wu, Jie Wu, Jie Yang, Jiashuai Liu, Jiashuo Li, Jingyang Zhang, Junjing Guo, Junzhe Lin, Kaixiang Li, Lei Liu, Lei Xia, Liang Zhao, Liguo Tan, Liwen Huang, Liying Shi, Ming Li, Mingliang Li, Muhua Cheng, Na Wang, Qiaohui Chen, Qinglin He, Qiuyan Liang, Quan Sun, Ran Sun, Rui Wang, Shaoliang Pang, Shiliang Yang, Sitong Liu, Siqi Liu, Shuli Gao, Tiancheng Cao, Tianyu Wang, Weipeng Ming, Wenqing He, Xu Zhao, Xuelin Zhang, Xianfang Zeng, Xiaojia Liu, Xuan Yang, Yaqi Dai, Yanbo Yu, Yang Li, Yineng Deng, Yingming Wang, Yilei Wang, Yuanwei Lu, Yu Chen, Yu Luo, Yuchu Luo, Yuhe Yin, Yuheng Feng, Yuxiang Yang, Zecheng Tang, Zekai Zhang, Zidong Yang, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu, Heung-Yeung Shum, Daxin Jiang,
Abstract要約: 我々は、30Bationalパラメータと最大204フレームの動画を生成する機能を備えた、テキストからビデオまでの事前トレーニングモデルであるStep-Video-T2Vを提案する。 Vari Autoencoder (Vari Autoencoder, Video-VAE) はビデオ生成タスク用に設計されており、16x16空間圧縮比と8x時間圧縮比を達成している。 Step-Video-T2Vのパフォーマンスは、新しいビデオ生成ベンチマークであるStep-Video-T2V-Evalで評価され、最先端のテキスト・ビデオの品質を示している。
参考スコア（独自算出の注目度）: 133.01510927611452
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.
Abstract（参考訳）: 我々は30Bパラメータと最大204フレームの動画を生成する機能を備えた最先端のテキスト・ビデオ事前学習モデルであるStep-Video-T2Vを提案する。深部圧縮変分オートエンコーダである Video-VAE はビデオ生成作業用に設計されており,空間圧縮比は16×16で,時間圧縮比は8倍であり,ビデオ再構成品質は例外的である。ユーザプロンプトは2つのバイリンガルテキストエンコーダを使って符号化され、英語と中国語の両方を扱う。 3DフルアテンションのDiTは、フローマッチングを用いて訓練され、入力ノイズを潜在フレームに分解するために使用される。ビデオベースのDPOアプローチであるVideo-DPOは、アーティファクトを低減し、生成されたビデオの視覚的品質を改善する。また、トレーニング戦略を詳しく説明し、重要な観察と洞察を共有します。 Step-Video-T2Vのパフォーマンスは、新しいビデオ生成ベンチマークであるStep-Video-T2V-Evalで評価され、オープンソースエンジンと商用エンジンの両方と比較して、最先端のテキスト・ビデオ品質を示している。さらに,現在の拡散モデルパラダイムの限界について論じ,ビデオ基盤モデルの今後の方向性について概説する。 Step-Video-T2VとStep-Video-T2V-Evalの両方をhttps://github.com/stepfun-ai/Step-Video-T2Vで公開しています。オンライン版もhttps://yuewen.cn/videosからアクセスできる。私たちのゴールは、ビデオファンデーションモデルの革新を加速し、ビデオコンテンツクリエーターを力づけることです。

論文の概要: Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

関連論文リスト