Fugu-MT 論文翻訳(概要): Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework

論文の概要: Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework

arxiv url: http://arxiv.org/abs/2606.19073v1
Date: Wed, 17 Jun 2026 13:44:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:51.184479
Title: Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework
Title（参考訳）: 画像HOI編集のためのI2Vモデルモデリング:認知ベンチマークとエージェント自己修正フレームワーク
Authors: Jiayi Gao, Qingchao Chen, Yuxin Peng, Yang Liu,
Abstract要約: 現在の画像編集法は静的属性で優れているが、複雑なHuman-Object Interactions (HOI)では失敗する 3つの進歩的認知レベルを持つ総合的なベンチマークであるHOI-Editを紹介する。 HOI-Editでは、SCPEは対話におけるNano Bananaのような最先端(SOTA)編集モデルと競合するパフォーマンスを実現している。
参考スコア（独自算出の注目度）: 62.705866028818576
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current image editing methods excel at static attributes but fail at complex Human-Object Interactions (HOI), a critical challenge unaddressed by existing benchmarks that conflate HOI with static attributes, relying on global metrics incapable of simultaneously assessing dynamic interaction validity and entangled human-object pair preservation. Thus, we first introduce HOI-Edit, a comprehensive benchmark with three progressive cognitive levels, which features an automated metric HOI-Eval that reliably evaluates instance-level interaction by letting VLM Q&A after thinking with images containing grounded Human-Object pairs. Considering the task's essence of remodeling dynamic relationships, we benchmark Image-to-Video (I2V) models, finding them inherently suited for dynamic editing due to their temporal generation capabilities. Crucially, beyond superior performance, this capability provides a "replay of the failure process," offering unique diagnosability into why errors occur. We thus propose SCPE (Self-Correcting Process Editing), a novel, agentic self-correcting framework that constrains the generation of I2V models through iteratively refined prompts, enabling the generated videos to more accurately present the target HOI. Extracted frames from these videos are the final editing results. On HOI-Edit, SCPE achieves performance competitive with state-of-the-art (SOTA) editing models like Nano Banana on interaction. Code is available at https://github.com/oceanflowlab/HOI-Edit.
Abstract（参考訳）: 現在の画像編集手法は静的属性では優れているが、複雑なHuman-Object Interactions (HOI)では失敗する。これは、Human-Object Interactions (HOI)は、静的属性とHOIを分割する既存のベンチマークでは、動的相互作用の妥当性と絡み合ったオブジェクトペア保存を同時に評価できないグローバルなメトリクスに依存している。そこで我々はまず,3つの進歩的認知レベルを持つ総合的なベンチマークであるHOI-Editを紹介した。このベンチマークは,人間とオブジェクトのペアを接地した画像について検討した後,VLMのQ&Aを行うことによって,インスタンスレベルのインタラクションを確実に評価する自動メトリクスHOI-Evalを備えている。動的関係をモデル化するタスクの本質を考慮し、画像対ビデオ(I2V)モデルをベンチマークし、時間生成能力により動的編集に本質的に適していることを示す。重要なことに、この機能は優れたパフォーマンス以上の「障害プロセスの再生」を提供し、なぜエラーが発生したのかをユニークな診断性を提供する。 SCPE(Self-Correcting Process Editing)は,反復的に改良されたプロンプトによってI2Vモデルの生成を制限し,生成したビデオがより正確にターゲットHOIを提示することを可能にする,新規なエージェント型自己修正フレームワークである。これらのビデオから抽出されたフレームが最終的な編集結果である。 HOI-Editでは、SCPEは対話におけるNano Bananaのような最先端(SOTA)編集モデルと競合するパフォーマンスを実現している。コードはhttps://github.com/oceanflowlab/HOI-Editで入手できる。

論文の概要: Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework

関連論文リスト