Fugu-MT 論文翻訳(概要): LongCat-Video-Avatar 1.5 Technical Report

論文の概要: LongCat-Video-Avatar 1.5 Technical Report

arxiv url: http://arxiv.org/abs/2605.26486v1
Date: Tue, 26 May 2026 02:54:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:41.592736
Title: LongCat-Video-Avatar 1.5 Technical Report
Title（参考訳）: LongCat-Video-Avatar 1.5 Technical Report
Authors: Meituan LongCat Team, Xunliang Cai, Meng Cheng, Feng Gao, Zhe Kong, Jiamu Li, Le Li, Weiheng Li, Hongyu Liu, Shuai Tan, Xiaoming Wei, Tianyu Yang, Yong Zhang,
Abstract要約: LongCat-Video-Avatar 1.5は、システマティックエンジニアリングとプロダクションレディを優先する、アップグレードされたオープンソースフレームワークである。 v1.5は、正確なリップ同期、フルボディの時間安定性、厳密なアイデンティティ整合性を持った堅牢な長ビデオ生成を実現する。
参考スコア（独自算出の注目度）: 39.46508887787761
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite advances in audio-driven video generation, achieving commercial-grade stability remains challenging. We present LongCat-Video-Avatar 1.5, an upgraded open-source framework prioritizing systematic engineering and production-readiness over architectural novelty. By upgrading the audio encoder to Whisper Large and meticulously scaling our training recipes, v1.5 achieves accurate lip-synchronization, full-body temporal stability, and robust long-video generation with strict identity consistency. Through rigorous data curation and RLHF Training, the model readily generalizes to stylized domains such as anime and animals, and natively handles complex real-world conditions, such as multi-person interactions and object handling. Furthermore, addressing the practical demands of industrial deployment, we employ advanced step distillation to accelerate inference to an optimal 8 NFE, achieving a favorable trade-off between serving efficiency and visual fidelity. The superiority of our approach is validated through extensive quantitative metrics and a rigorous human evaluation conducted on a comprehensive benchmark of over 500 diverse test cases. Results show that v1.5 achieves competitive or superior performance compared to leading closed-source systems (e.g., HeyGen, OmniHuman 1.5, Kling Avatar 2.0) across human-likeness ratings and expert-level quality assessments on our benchmark. With its open-source release, LongCat-Video-Avatar 1.5 narrows the gap between academic research prototypes and commercial-grade deployment.
Abstract（参考訳）: 音声によるビデオ生成の進歩にもかかわらず、商業レベルの安定性を達成することは依然として困難である。アーキテクチャのノベルティよりもシステマティックなエンジニアリングとプロダクションの可読性を優先したオープンソースのフレームワークであるLongCat-Video-Avatar 1.5を紹介する。オーディオエンコーダをWhisper Largeにアップグレードし、トレーニングレシピを慎重にスケールアップすることで、v1.5は正確なリップ同期、フルボディの時間安定性、厳密なアイデンティティ一貫性を備えた堅牢な長ビデオ生成を実現します。厳密なデータキュレーションとRLHFトレーニングを通じて、このモデルはアニメや動物のようなスタイル化されたドメインに容易に一般化し、マルチパーソンインタラクションやオブジェクトハンドリングといった複雑な現実世界の条件をネイティブに扱う。さらに, 産業展開の実際的な要求に応えるため, 最適8 NFEへの推算を加速するために高度段階蒸留を採用し, サービス効率と視力とのトレードオフを良好に達成した。提案手法の優位性は,500以上の多種多様なテストケースの総合的なベンチマークで実施した,広範囲な定量的評価と厳密な人的評価によって検証される。その結果、v1.5は、主要なクローズドソースシステム(例えば、HeyGen、OmniHuman 1.5、Kling Avatar 2.0)と比較して、人間に近い評価と専門家レベルの品質評価をベンチマークで比較すると、競争力や優れたパフォーマンスを実現していることがわかった。オープンソースリリースにより、LongCat-Video-Avatar 1.5は、学術研究のプロトタイプと商用レベルのデプロイメントのギャップを狭める。

論文の概要: LongCat-Video-Avatar 1.5 Technical Report

関連論文リスト