Fugu-MT 論文翻訳(概要): GEditBench v2: A Human-Aligned Benchmark for General Image Editing

論文の概要: GEditBench v2: A Human-Aligned Benchmark for General Image Editing

arxiv url: http://arxiv.org/abs/2603.28547v1
Date: Mon, 30 Mar 2026 15:08:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.465881
Title: GEditBench v2: A Human-Aligned Benchmark for General Image Editing
Title（参考訳）: GEditBench v2: 画像編集のための人為的なベンチマーク
Authors: Zhangqi Jiang, Zheng Sun, Xianfang Zeng, Yufeng Yang, Xuanyang Zhang, Yongliang Wu, Wei Cheng, Gang Yu, Xu Yang, Bihan Wen,
Abstract要約: GEditBench v2は、23のタスクにまたがる1200の現実世界のユーザクエリを備えた包括的なベンチマークである。また、視覚的整合性を評価するためのオープンソースのペアワイドアセスメントモデルであるPVC-Judgeを提案する。 PVC-Judgeは、オープンソースモデルの最先端評価性能を達成し、平均してGPT-5.1を超えている。
参考スコア（独自算出の注目度）: 58.86807672117726
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in image editing have enabled models to handle complex instructions with impressive realism. However, existing evaluation frameworks lag behind: current benchmarks suffer from narrow task coverage, while standard metrics fail to adequately capture visual consistency, i.e., the preservation of identity, structure and semantic coherence between edited and original images. To address these limitations, we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks. Furthermore, we propose PVC-Judge, an open-source pairwise assessment model for visual consistency, trained via two novel region-decoupled preference data synthesis pipelines. Besides, we construct VCReward-Bench using expert-annotated preference pairs to assess the alignment of PVC-Judge with human judgments on visual consistency evaluation. Experiments show that our PVC-Judge achieves state-of-the-art evaluation performance among open-source models and even surpasses GPT-5.1 on average. Finally, by benchmarking 16 frontier editing models, we show that GEditBench v2 enables more human-aligned evaluation, revealing critical limitations of current models, and providing a reliable foundation for advancing precise image editing.
Abstract（参考訳）: 画像編集の最近の進歩により、モデルが印象的なリアリズムを持つ複雑な命令を処理できるようになった。しかし、既存の評価フレームワークは遅れている: 現在のベンチマークはタスクカバレッジが狭く、標準メトリクスは視覚的一貫性を適切に捉えられず、すなわち、編集された画像とオリジナルの画像の間のアイデンティティ、構造、セマンティックコヒーレンスを保存できない。これらの制限に対処するため、GEditBench v2は、23タスクにまたがる1200の現実世界のユーザクエリを備えた包括的なベンチマークである。さらに、視覚的一貫性のためのオープンソースのペアワイズアセスメントモデルであるPVC-Judgeを、2つの領域分離された嗜好データ合成パイプラインを用いて訓練する。また,PVC-Judgeと人間の判断による視覚的整合性評価のアライメントを評価するために,専門家アノテートされた選好ペアを用いたVCReward-Benchを構築した。実験により,我々のPVC-Judgeは,オープンソースモデル間の最先端評価性能を達成し,GPT-5.1を平均を超えていることがわかった。最後に,16のフロンティア編集モデルのベンチマークにより,GEditBench v2により,より人間に近い評価が可能となり,現行モデルの限界が明らかになり,精度の高い画像編集を行うための信頼性の高い基盤が提供される。

論文の概要: GEditBench v2: A Human-Aligned Benchmark for General Image Editing

関連論文リスト