Fugu-MT 論文翻訳(概要): Seeing Before Reasoning: A Unified Framework for Generalizable and Explainable Fake Image Detection

論文の概要: Seeing Before Reasoning: A Unified Framework for Generalizable and Explainable Fake Image Detection

arxiv url: http://arxiv.org/abs/2509.25502v1
Date: Mon, 29 Sep 2025 20:59:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 14:44:59.942181
Title: Seeing Before Reasoning: A Unified Framework for Generalizable and Explainable Fake Image Detection
Title（参考訳）: 推論の前に見る:一般化可能で説明可能なフェイク画像検出のための統一フレームワーク
Authors: Kaiqing Lin, Zhiyuan Yan, Ruoxin Chen, Junyan Ye, Ke-Yue Zhang, Yue Zhou, Peng Jin, Bin Li, Taiping Yao, Shouhong Ding,
Abstract要約: この失敗の根源は、根本的なミスマッチにある、と私たちは主張する。本稿では,偽画像検出のための汎用的で説明可能な,会話型アシスタントであるForensic-Chatを提案する。
参考スコア（独自算出の注目度）: 58.82268659497348
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Detecting AI-generated images with multimodal large language models (MLLMs) has gained increasing attention, due to their rich world knowledge, common-sense reasoning, and potential for explainability. However, naively applying those MLLMs for detection often leads to suboptimal performance. We argue that the root of this failure lies in a fundamental mismatch: MLLMs are asked to reason about fakes before they can truly see them. First, they do not really see: existing MLLMs' vision encoders are primarily optimized for semantic-oriented recognition rather than the perception of low-level signals, leaving them insensitive to subtle forgery traces. Without access to reliable perceptual evidence, the model grounds its judgment on incomplete and limited visual observations. Second, existing finetuning data for detection typically uses narrow, instruction-style formats, which diverge sharply from the diverse, heterogeneous distributions seen in pretraining. In the absence of meaningful visual cues, the model therefore exploits these linguistic shortcuts, resulting in catastrophic forgetting of pretrained knowledge (even the basic dialogue capabilities). In response, we advocate for a new paradigm: seeing before reasoning. We propose that MLLMs should first be trained to perceive artifacts-strengthening their artifact-aware visual perception-so that subsequent reasoning is grounded in actual observations. We therefore propose Forensic-Chat, a generalizable, explainable, and still-conversational (for multi-round dialogue) assistant for fake image detection. We also propose ExplainFake-Bench, a benchmark tailored for the evaluation of the MLLM's explainability for image forensics from five key aspects. Extensive experiments show its superiority of generalization and genuinely reliable explainability.
Abstract（参考訳）: マルチモーダル大言語モデル(MLLM)を用いたAI生成画像の検出が注目されている。しかし、これらのMLLMを間接的に検出に応用すると、しばしば準最適性能が生じる。この失敗の根源は、根本的なミスマッチにある、と私たちは主張する。既存のMLLMの視覚エンコーダは、低レベルの信号の認識よりも、主にセマンティック指向の認識に最適化されており、微妙な偽の痕跡に敏感なままである。信頼できる知覚的証拠にアクセスできなければ、モデルは不完全で限られた視覚的観察に基づいて判断する。第二に、検出のための既存の微調整データは、通常、細い命令スタイルのフォーマットを使用し、事前訓練で見られる多様で異質な分布から大きく分岐する。意味のある視覚的手がかりがない場合、このモデルはこれらの言語的ショートカットを利用するため、事前訓練された知識(基本的な対話能力さえ)を破滅的に忘れてしまう。これに応えて、我々は新たなパラダイムを提唱する。 MLLMは、まず人工物を認識するために訓練されるべきであり、人工物を認識する視覚的知覚を補強することで、その後の推論が実際の観察に基礎を置いていることを示唆する。そこで我々は,偽画像検出のための汎用的で説明可能な(多ラウンド対話のための)対話型アシスタントであるForensic-Chatを提案する。また,5つの重要な側面からMLLMの画像法医学的説明可能性を評価するためのベンチマークであるExplainFake-Benchを提案する。広範な実験は、一般化と真に信頼できる説明可能性の優位性を示している。

論文の概要: Seeing Before Reasoning: A Unified Framework for Generalizable and Explainable Fake Image Detection

関連論文リスト