Fugu-MT 論文翻訳(概要): CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

論文の概要: CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

arxiv url: http://arxiv.org/abs/2603.29664v1
Date: Tue, 31 Mar 2026 12:25:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-01 15:25:03.641345
Title: CutClaw: Agentic Hours-Long Video Editing via Music Synchronization
Title（参考訳）: CutClaw: ミュージックシンクロナイゼーションによるエージェントによる長時間のビデオ編集
Authors: Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun,
Abstract要約: CutClawは、数時間の生の映像を意味のある短いビデオに編集するために設計された、自律的なマルチエージェントフレームワークだ。同期音楽の動画を制作し、指示と視覚的に魅力的な外観が続く。われわれは、CutClawが高品質なリズムアライメントビデオを生成する際に、最先端のベースラインを大幅に上回っていることを示すための詳細な実験を行っている。
参考スコア（独自算出の注目度）: 96.62825277039117
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.
Abstract（参考訳）: ビデオコンテンツをオーディオアライメントで編集することは、現在のソーシャルメディアにおけるデジタルヒューマンメイドアートを形成する。しかし、手動ビデオ編集の時間的・反復的な性質は、映画製作者やプロのコンテンツ制作者にとっても長年の課題だった。本稿では,複数のマルチモーダル言語モデル(MLLM)をエージェントシステムとして活用した,数時間の生映像を有意義なショートビデオに編集するための,自律型マルチエージェントフレームワークであるCutClawを紹介する。同期音楽の動画を制作し、指示と視覚的に魅力的な外観が続く。より詳しくは、我々のアプローチは、視覚的およびオーディオ的映像全体にわたる微細な細部とグローバルな構造の両方をキャプチャする階層的マルチモーダル分解を採用することから始まります。そして、物語の一貫性を確保するために、プレイライターエージェントがストーリーテリングフロー全体をオーケストレーションし、長期の物語を構造化し、視覚的なシーンを音楽的なシフトに固定する。最後に、短い編集ビデオを構築するために、編集者とレビューターエージェントは、厳密な美的および意味的な基準に基づいて、きめ細かい視覚的コンテンツを選択することで、最終カットを協調的に最適化する。われわれは、CutClawが高品質なリズムアライメントビデオを生成する際に、最先端のベースラインを大幅に上回っていることを示すための詳細な実験を行っている。コードは、https://github.com/GVCLab/CutClaw.comで入手できる。

論文の概要: CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

関連論文リスト