Fugu-MT 論文翻訳(概要): GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-Preserver

論文の概要: GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-Preserver

arxiv url: http://arxiv.org/abs/2605.30045v1
Date: Thu, 28 May 2026 14:58:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.410032
Title: GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-Preserver
Title（参考訳）: GenEraser: バランスのとれたテキストマスク誘導と分離されたロケータ-プレサーバによる一般的なビデオオブジェクトの削除
Authors: Yuqing Chen, Lin Liu, Haisu Wu, Xiaopeng Zhang, Yaowei Wang, Yujiu Yang, Qi Tian,
Abstract要約: GenEraserは、一般化された高忠実度ビデオオブジェクトとエフェクト除去のための新しいフレームワークである。拡散変換器のマルチモーダル先行をフル活用するために,バイパートテキストガイダンスと組み合わせたMC-MoE(Multi-Conditional Mixture-of-Experts)を導入する。また、マスクとテキスト条件の相対的優位性を適応的にバランスさせるための学習可能なDeep C'FGのFusion機構を提案する。
参考スコア（独自算出の注目度）: 107.6554560318856
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video object removal frequently struggles to simultaneously eliminate target objects and their associated physical effects (e.g., smoke, reflections, light, and ripples) in out-of-domain scenarios due to complex spatiotemporal ambiguities. While existing methods primarily rely on spatial masks, they often fail to capture weakly correlated effects, and the potential of explicit textual guidance remains underexplored. Furthermore, a fundamental optimization conflict exists in removal models between high-level semantic generalization and precise pixel-level background preservation. To address these challenges, we propose GenEraser, a novel framework for generalized and high-fidelity video object and effect removal. First, we introduce a Multi-Conditional Mixture-of-Experts (MC-MoE) paired with Bipartite Text guidance to fully exploit the multimodal priors of Diffusion Transformers, significantly enhancing the identification of complex effects. Second, a Learnable Deep ``CFG'' Fusion mechanism (LD-CFG) is developed to adaptively balance the relative dominance of mask and textual conditions across diverse scenarios. Finally, we propose a Decoupled Expert Architecture, comprising a Locator and a Preserver, to mitigate the inherent trade-off between semantic generalization and pixel alignment. Extensive experiments demonstrate that our GenEraser surpasses recent state-of-the-art approaches, achieving significant quantitative improvements (e.g., $2.16$ dB and $1.44$ dB on the ROSE Benchmark and VOR-Eval, respectively) while maintaining exceptionally robust generalization in open-world scenarios. https://cyqii.github.io/GenEraser.github.io/
Abstract（参考訳）: ビデオオブジェクトの除去は、複雑な時空間的曖昧さのために、ドメイン外のシナリオにおいて、ターゲットオブジェクトとその関連する物理的効果(例えば、煙、反射、光、波紋)を同時に除去するのに苦労する。既存の手法は、主に空間マスクに依存しているが、弱い相関効果を捉えることができず、明示的なテキストガイダンスの可能性は未解明のままである。さらに、高レベルのセマンティック一般化と正確なピクセルレベルの背景保存の間の除去モデルには、根本的な最適化の矛盾が存在する。これらの課題に対処するために、一般化された高忠実度ビデオオブジェクトとエフェクト除去のための新しいフレームワークであるGenEraserを提案する。まず,拡散変換器のマルチモーダル先行をフル活用するために,バイパートテキストガイダンスと組み合わせたMC-MoE(Multi-Conditional Mixture-of-Experts)を導入する。第二に、学習可能なDeep ``CFG' 融合機構 (LD-CFG) を開発し、様々なシナリオにおいてマスクとテキスト条件の相対的優位性を適応的にバランスさせる。最後に,ロケータとプリサーバからなる疎結合エキスパートアーキテクチャを提案し,セマンティック・ジェネリゼーションと画素アライメントのトレードオフを緩和する。大規模な実験により、我々のGenEraserは最近の最先端のアプローチを超越し、重要な定量的改善(ROSE BenchmarkとVOR-Evalでそれぞれ$2.16$dBと$1.44$dB)を達成しつつ、オープンワールドのシナリオにおける非常に堅牢な一般化を維持しながら達成している。 https://cyqii.github.io/GenEraser.github.io/

論文の概要: GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-Preserver

関連論文リスト