Fugu-MT 論文翻訳(概要): A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

論文の概要: A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

arxiv url: http://arxiv.org/abs/2606.19259v1
Date: Wed, 17 Jun 2026 16:37:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:51.273894
Title: A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2
Title（参考訳）: GPT画像からのAI生成テキストリッチ画像検出のためのマルチドメインベンチマーク-2
Authors: Yijin Wang, Shuyi Wang, Wenhan Zhang, Yuqi Ouyang,
Abstract要約: OpenAI の GPT Image 2 で生成されたテキストリッチな画像を検出するためのベンチマークを導入する。ベンチマークには、商業ポスター、インフォグラフィック、学術ポスター、レシート、テーブル、UIスクリーンショットの6つのカテゴリにまたがる8,602のイメージが含まれている。このベンチマークを用いて、ゼロショット設定で5つの代表的なAI生成画像検出器を評価し、その全体的、カテゴリ的、および後処理の堅牢性を分析する。
参考スコア（独自算出の注目度）: 5.27107161551086
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-rich images often contain privacy-sensitive, transactional, or decision-relevant information. As recent multimodal image generation models become increasingly capable of synthesizing realistic textual content and structured visual designs, detecting AI-generated text-rich images has become an important challenge for digital trust and content authenticity. Existing benchmarks, however, largely focus on object-centric images and provide limited coverage of scenarios where textual semantics and layout organization are central. In this paper, we introduce a multi-domain benchmark for detecting text-rich images generated by OpenAI's GPT Image 2. The benchmark contains 8,602 images across six representative categories: commercial posters, infographics, academic posters, receipts, tables, and UI screenshots. Using this benchmark, we evaluate five representative AI-generated image detectors in a zero-shot setting and analyze their overall, category-wise, and post-processing robustness. Our results show that detector performance is highly domain-dependent: methods that perform well in some categories often fail on others, and even the strongest conventional detector exhibits severe sensitivity to JPEG compression. We further conduct an exploratory evaluation with a multimodal vision-language model, revealing both its promise and its limitations on structured formats. These findings highlight the need for text- and layout-aware detection methods for modern AI-generated images. Our dataset is released at XXX.
Abstract（参考訳）: テキストリッチなイメージは、プライバシに敏感な、トランザクション、あるいは意思決定に関連する情報を含むことが多い。最近のマルチモーダル画像生成モデルは、現実的なテキストコンテンツと構造化されたビジュアルデザインを合成する能力がますます高まっているため、AI生成されたテキストリッチ画像の検出は、デジタル信頼とコンテンツ認証にとって重要な課題となっている。しかし、既存のベンチマークは主にオブジェクト中心の画像に焦点を当てており、テキストの意味論とレイアウト組織が中心となるシナリオを限定的にカバーしている。本稿では,OpenAI の GPT Image 2 で生成されたテキストリッチな画像を検出するためのマルチドメインベンチマークを提案する。ベンチマークには、商業ポスター、インフォグラフィック、学術ポスター、レシート、テーブル、UIスクリーンショットの6つのカテゴリにまたがる8,602のイメージが含まれている。このベンチマークを用いて、ゼロショット設定で5つの代表的なAI生成画像検出器を評価し、その全体的、カテゴリ的、および後処理の堅牢性を分析する。その結果,検出器の性能はドメインに依存していることが明らかとなった。いくつかのカテゴリでよく機能する手法は,他のカテゴリでは失敗することが多く,従来の検出器でもJPEG圧縮に対して強い感度を示す。さらに,マルチモーダル視覚言語モデルによる探索的評価を行い,その可能性と制約を構造化形式上で明らかにする。これらの知見は、現代のAI生成画像に対するテキストおよびレイアウト認識検出手法の必要性を浮き彫りにした。データセットはXXXでリリースされます。

論文の概要: A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

関連論文リスト