Fugu-MT 論文翻訳(概要): Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations

論文の概要: Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations

arxiv url: http://arxiv.org/abs/2511.14100v1
Date: Tue, 18 Nov 2025 03:37:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-19 16:23:52.911889
Title: Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations
Title（参考訳）: デジタル双対表現を用いた強化学習によるテキスト駆動推論ビデオ編集
Authors: Yiqing Shen, Chenjia Li, Mathias Unberath,
Abstract要約: ビデオ編集モデルでは,編集対象を推測するマルチホップ推論によって暗黙的なクエリを解釈しなければならない。 RIVERは、空間的関係、時間的軌跡、意味的属性を保存したビデオコンテンツのデジタル双対表現を通じて生成から推論を分離する。 RIVERトレーニングは、推論精度と生成品質を評価する報酬付き強化学習を使用する。
参考スコア（独自算出の注目度）: 8.479321655643195
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-driven video editing enables users to modify video content only using text queries. While existing methods can modify video content if explicit descriptions of editing targets with precise spatial locations and temporal boundaries are provided, these requirements become impractical when users attempt to conceptualize edits through implicit queries referencing semantic properties or object relationships. We introduce reasoning video editing, a task where video editing models must interpret implicit queries through multi-hop reasoning to infer editing targets before executing modifications, and a first model attempting to solve this complex task, RIVER (Reasoning-based Implicit Video Editor). RIVER decouples reasoning from generation through digital twin representations of video content that preserve spatial relationships, temporal trajectories, and semantic attributes. A large language model then processes this representation jointly with the implicit query, performing multi-hop reasoning to determine modifications, then outputs structured instructions that guide a diffusion-based editor to execute pixel-level changes. RIVER training uses reinforcement learning with rewards that evaluate reasoning accuracy and generation quality. Finally, we introduce RVEBenchmark, a benchmark of 100 videos with 519 implicit queries spanning three levels and categories of reasoning complexity specifically for reasoning video editing. RIVER demonstrates best performance on the proposed RVEBenchmark and also achieves state-of-the-art performance on two additional video editing benchmarks (VegGIE and FiVE), where it surpasses six baseline methods.
Abstract（参考訳）: テキスト駆動のビデオ編集では、ユーザーはテキストクエリのみを使用してビデオコンテンツを修正することができる。既存の方法は、正確な空間的位置と時間的境界を持つ編集対象の明示的な記述が提供される場合、ビデオの内容を変更することができるが、これらの要件は、ユーザが意味的特性やオブジェクト関係を暗黙的に参照することで、編集を概念化しようとすると、現実的ではない。ビデオ編集モデルでは,編集対象を推測するマルチホップ推論によって暗黙的なクェリを解釈し,修正を行う前に,その複雑なタスクであるRIVER(Reasoning-based Implicit Video Editor)を提案する。 RIVERは、空間的関係、時間的軌跡、意味的属性を保存したビデオコンテンツのデジタル双対表現を通じて生成から推論を分離する。その後、大きな言語モデルがこの表現を暗黙のクエリと共に処理し、修正を決定するためにマルチホップ推論を実行し、拡散ベースのエディタを誘導してピクセルレベルの変更を実行する構造化命令を出力する。 RIVERトレーニングは、推論精度と生成品質を評価する報酬付き強化学習を使用する。最後に、RVEBenchmarkを紹介した。RVEBenchmarkは、519の暗黙のクエリを持つ100のビデオのベンチマークで、3つのレベルとカテゴリにまたがる推論の複雑さを推論する。 RIVERは、提案されたRVEベンチマークで最高のパフォーマンスを示し、また、VegGIEとFiVEの2つの追加のビデオ編集ベンチマークで最先端のパフォーマンスを達成する。

論文の概要: Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations

関連論文リスト