Fugu-MT 論文翻訳(概要): When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

論文の概要: When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

arxiv url: http://arxiv.org/abs/2605.27348v1
Date: Tue, 26 May 2026 17:50:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:42.578355
Title: When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection
Title（参考訳）: AIによる画像検出のためのセマンティックキューとしての社会的ゲイズ整合性
Authors: Kim Jihyeon, Sohee Kim, Soosan Lee, Souhwan Jung, James Matthew Rehg, Hyesong Choi,
Abstract要約: 本稿では,視線方向の相互コヒーレンス,頭部アライメント,対人関係の瞳孔配置として定義された高レベルの意味的キューであるソーシャル・ゲイズ・コンシステンシーを紹介する。既存の低レベルパラダイムに対して,これまで未利用であった検出軸を構成することを示す。 4ステップのアカウントでは、単一インパインター(FLUX.1-Fill)のトレーニングがマルチジェネレータスイートに移行した理由が説明されている。
参考スコア（独自算出の注目度）: 8.55568342913716
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent generative models have largely closed the gap on low-level artifacts - pixel fingerprints, frequency anomalies, upsampling traces - particularly in person-centric and partial-edit settings where the manipulated region is small and surrounded by photometrically authentic content. We introduce Social Gaze Consistency, a high-level semantic cue defined as the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals, and show that it constitutes a previously underutilized detection axis orthogonal to existing low-level paradigms. We instantiate this insight through three coupled mechanisms: (i) a controlled diagnostic dataset with region-specific perturbations of gaze-consistent imagery, where strict pair-level grouping forecloses generator-fingerprint memorization as an optimization-time shortcut rather than relying on augmentation; (ii) Block-Compositional Caption Supervision, which holds a single 5-block reasoning skeleton invariant across 1,250 macro-combined captions, decoupling reasoning consistency from surface diversity; (iii) Cross-architecture validation showing the same supervision improves a vision-language backbone (FakeVLM) by +3.7 pp on the COCOAI Interaction subset (balanced accuracy 67.8 -> 71.5) and +1.3 pp on the COCOAI Person subset (83.0 -> 84.3), with consistent gains on a vision-only backbone (Effort), evidencing a backbone-agnostic cue. Real- and fake-class recalls rise simultaneously, ruling out a "predict-all-fake" artifact. A four-step mechanistic account - paired-edit shortcut blocking, hard-to-easy difficulty transfer, CLIP prior preservation, and diffusion-family shared spectral weakness in periocular structure - explains why training on a single inpainter (FLUX.1-Fill) transfers to multi-generator suites. We will release the code upon acceptance to facilitate reproducibility.
Abstract（参考訳）: 最近の生成モデルは、ピクセルの指紋、周波数異常、アップサンプリングトレース(特に、操作された領域が小さく、光量的に認証されたコンテンツに囲まれている人中心および部分編集設定)において、低レベルのアーティファクトのギャップを大きく埋めている。本稿では,視線方向の相互コヒーレンス,頭部のアライメント,および瞳孔配置として定義された高レベルなセマンティックキューであるSocial Gaze Consistencyを紹介し,既存の低レベルパラダイムと直交する従来未利用な検出軸を構成することを示す。この洞察を3つの結合メカニズムでインスタンス化する。 i) 厳密なペアレベルのグループ化が拡張に頼るのではなく、最適化時ショートカットとしてジェネレータ・フィンガープリント記憶を閉ざした、視線一貫性画像の領域特異的摂動を伴う制御された診断データセット。 (二)ブロック・コンポジション・キャプション・スーパービジョンは、1,250個のマクロ組み合わせキャプションにまたがる単一の5ブロック推論スケルトンを保有し、表面の多様性から推論一貫性を分離する。 3 同じ監督を示すクロスアーキテクチャ検証は、COCOAIインタラクションサブセット(精度67.8～>71.5)の+3.7pp、COCOAI Personサブセット(83.0～>84.3)の+1.3ppのビジョン言語バックボーン(FakeVLM)を改良し、ビジョンのみのバックボーン(Effort)に一貫した利得を付与する。リアルとフェイククラスのリコールは同時に増加し、"予測オールフェイク"アーティファクトを除外する。四段階のメカニスティック・アカウント(ペア・エジット・ショートカット・ブロッキング、難易度転送、CLIP前保存、拡散圏共有スペクトルの弱さ)は、なぜ単一インペーター(FLUX.1-Fill)のトレーニングを複数世代で行うのかを説明する。再現性を促進するため、受け入れに応じてコードを公開します。

論文の概要: When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

関連論文リスト