Fugu-MT 論文翻訳(概要): Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion

論文の概要: Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion

arxiv url: http://arxiv.org/abs/2603.06140v1
Date: Fri, 06 Mar 2026 10:48:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:45.518744
Title: Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion
Title（参考訳）: Place-it-R1:ビデオオブジェクト挿入のためのMLLMの環境認識推論の可能性
Authors: Bohai Gu, Taiyi Wu, Dazhao Du, Jian Liu, Shuai Yang, Xiaotong Zhao, Alan Zhao, Song Guo,
Abstract要約: Place-it-R$1$はビデオオブジェクト挿入のためのエンドツーエンドフレームワークである。それは、Think-then-Placeパラダイムに従って、ビデオ拡散を編成する。 MLLMは物理的なシーン理解とインタラクション推論を行う。環境認識型連鎖トークンを生成し、有効な挿入領域を推測する。
参考スコア（独自算出の注目度）: 28.621908346945762
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern video editing techniques have achieved high visual fidelity when inserting video objects. However, they focus on optimizing visual fidelity rather than physical causality, leading to edits that are physically inconsistent with their environment. In this work, we present Place-it-R$1$, an end-to-end framework for video object insertion that unlocks the environment-aware reasoning potential of Multimodal Large Language Models (MLLMs). Our framework leverages the Chain-of-Thought (CoT) reasoning of MLLMs to orchestrate video diffusion, following a Think-then-Place paradigm. To bridge cognitive reasoning and generative execution, we introduce three key innovations: First, MLLM performs physical scene understanding and interaction reasoning, generating environment-aware chain-of-thought tokens and inferring valid insertion regions to explicitly guide the diffusion toward physically plausible insertion. Then, we introduce MLLM-guided Spatial Direct Preference Optimization (DPO), where diffusion outputs are fed back to the MLLM for scoring, enabling visual naturalness. During inference, the MLLM iteratively triggers refinement cycles and elicits adaptive adjustments from the diffusion model, forming a closed-loop that progressively enhances editing quality. Furthermore, we provide two user-selectable modes: a plausibility-oriented flexible mode that permits environment modifications (\eg, generating support structures) to enhance physical plausibility, and a fidelity-oriented standard mode that preserves scene integrity for maximum fidelity, offering users explicit control over the plausibility-fidelity trade-off. Extensive experiments demonstrate Place-it-R1 achieves physically-coherent video object insertion compared with state-of-the-art solutions and commercial models.
Abstract（参考訳）: 現代のビデオ編集技術は、映像オブジェクトを挿入する際の視覚的忠実度が高い。しかし、それらは物理的因果性よりも視覚的忠実度を最適化することに集中しており、物理的に環境と矛盾する編集に繋がる。本研究では,マルチモーダル大規模言語モデル (MLLM) の環境認識推論能力を解放する,ビデオオブジェクト挿入のためのエンドツーエンドフレームワークである Place-it-R$1 を提案する。我々のフレームワークは、MLLMのChain-of-Thought(CoT)推論を利用して、Think-then-Placeパラダイムに従ってビデオ拡散を編成する。認知的推論と生成的実行を橋渡しするために、まずMLLMは物理的なシーン理解と相互作用推論を行い、環境に配慮した連鎖トークンを生成し、有効挿入領域を推測し、物理的に妥当な挿入に向けて拡散を明示的に導く。次に、MLLM誘導空間指向性最適化(DPO)を導入し、拡散出力をMLLMにフィードバックしてスコア付けし、視覚的自然性を実現する。推論中、MLLMは改良サイクルを反復的にトリガーし、拡散モデルから適応調整を誘発し、編集品質を段階的に向上するクローズドループを形成する。さらに, ユーザ選択可能なモードとして, 環境修正(例えば, サポート構造の生成)を可能にする可視性指向フレキシブルモードと, 最大忠実度のためのシーンの整合性を保ち, ユーザに対して, 可視性-忠実性のトレードオフを明示的に制御する忠実性指向の標準モードの2つを提供する。大規模な実験では、Place-it-R1は最先端のソリューションや商用モデルと比較して物理的に整合性のあるビデオオブジェクト挿入を実現する。

論文の概要: Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion

関連論文リスト