Fugu-MT 論文翻訳(概要): Group Editing: Edit Multiple Images in One Go

論文の概要: Group Editing: Edit Multiple Images in One Go

arxiv url: http://arxiv.org/abs/2603.22883v3
Date: Thu, 26 Mar 2026 10:38:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 13:32:29.884785
Title: Group Editing: Edit Multiple Images in One Go
Title（参考訳）: Group Editing: 複数の画像を1回で編集する
Authors: Yue Ma, Xinyu Wang, Qianli Ma, Qinghe Wang, Mingzhe Zheng, Xiangpeng Yang, Hao Li, Chongbo Zhao, Jixuan Ying, Harry Yang, Hongyu Liu, Qifeng Chen,
Abstract要約: GroupEditingは、グループ内の画像間の明示的で暗黙的な関係を構築するフレームワークである。 GroupEditDataは、高品質なマスクと多数のイメージグループの詳細なキャプションを含むデータセットである。グループレベルの画像編集の有効性を評価するためのベンチマークであるGroupEditBenchを提案する。
参考スコア（独自算出の注目度）: 48.78947366708772
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we tackle the problem of performing consistent and unified modifications across a set of related images. This task is particularly challenging because these images may vary significantly in pose, viewpoint, and spatial layout. Achieving coherent edits requires establishing reliable correspondences across the images, so that modifications can be applied accurately to semantically aligned regions. To address this, we propose GroupEditing, a novel framework that builds both explicit and implicit relationships among images within a group. On the explicit side, we extract geometric correspondences using VGGT, which provides spatial alignment based on visual features. On the implicit side, we reformulate the image group as a pseudo-video and leverage the temporal coherence priors learned by pre-trained video models to capture latent relationships. To effectively fuse these two types of correspondences, we inject the explicit geometric cues from VGGT into the video model through a novel fusion mechanism. To support large-scale training, we construct GroupEditData, a new dataset containing high-quality masks and detailed captions for numerous image groups. Furthermore, to ensure identity preservation during editing, we introduce an alignment-enhanced RoPE module, which improves the model's ability to maintain consistent appearance across multiple images. Finally, we present GroupEditBench, a dedicated benchmark designed to evaluate the effectiveness of group-level image editing. Extensive experiments demonstrate that GroupEditing significantly outperforms existing methods in terms of visual quality, cross-view consistency, and semantic alignment.
Abstract（参考訳）: 本稿では,一連の関連画像に対して一貫した,一貫した修正を行うという課題に対処する。この課題は、これらの画像がポーズ、視点、空間的レイアウトにおいて著しく異なる可能性があるため、特に困難である。コヒーレントな編集を実現するには、画像全体にわたる信頼性の高い対応を確立する必要があるため、意味的に整合した領域に正確に修正を適用することができる。これを解決するために,グループ内の画像間の明示的および暗黙的な関係を構築する新しいフレームワークであるGroupEditingを提案する。本稿では,視覚的特徴に基づく空間的アライメントを提供するVGGTを用いて,幾何学的対応を抽出する。暗黙の側面では、画像群を擬似ビデオとして再構成し、事前学習されたビデオモデルから得られた時間的コヒーレンスを生かして、潜伏関係を捉える。これら2種類の対応を効果的に融合させるため、新しい融合機構によりVGGTからビデオモデルに明示的な幾何学的手がかりを注入する。大規模なトレーニングを支援するために,多数の画像グループを対象とした高品質なマスクと詳細なキャプションを含む新しいデータセットであるGroupEditDataを構築した。さらに、編集中のアイデンティティの保存を確保するため、複数の画像に対して一貫した外観を維持できるアライメント強化されたRoPEモジュールを導入する。最後に,グループレベルの画像編集の有効性を評価するための専用のベンチマークであるGroupEditBenchを紹介する。大規模な実験により、グループ編集は、視覚的品質、横断的な一貫性、セマンティックアライメントの点で、既存のメソッドを著しく上回っていることが示された。

論文の概要: Group Editing: Edit Multiple Images in One Go

関連論文リスト