Fugu-MT 論文翻訳(概要): Generalist Multimodal LLMs Gain Biometric Expertise via Human Salience

論文の概要: Generalist Multimodal LLMs Gain Biometric Expertise via Human Salience

arxiv url: http://arxiv.org/abs/2603.17173v1
Date: Tue, 17 Mar 2026 22:08:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.419235
Title: Generalist Multimodal LLMs Gain Biometric Expertise via Human Salience
Title（参考訳）: 汎用マルチモーダルLCMは人間のサリエンスによるバイオメトリックスペシャリストを獲得
Authors: Jacob Piland, Byron Dowling, Christopher Sweet, Adam Czajka,
Abstract要約: 汎用多目的大言語モデル(MLLM)は、人間の知識を付加してアイリスPADを実行することができる。専門家インフォームドプロンプトを持つジェミニは、特殊な畳み込みニューラルネットワーク(CNN)ベースのベースラインと人間の検査者の両方より優れていることを示す。この結果,機関プライバシ制約内に展開可能なMLLMは,アイリスPADに有効な経路であることが判明した。
参考スコア（独自算出の注目度）: 3.0925941606647123
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Iris presentation attack detection (PAD) is critical for secure biometric deployments, yet developing specialized models faces significant practical barriers: collecting data representing future unknown attacks is impossible, and collecting diverse-enough data, yet still limited in terms of its predictive power, is expensive. Additionally, sharing biometric data raises privacy concerns. Due to rapid emergence of new attack vectors demanding adaptable solutions, we thus investigate in this paper whether general-purpose multimodal large language models (MLLMs) can perform iris PAD when augmented with human expert knowledge, operating under strict privacy constraints that prohibit sending biometric data to public cloud MLLM services. Through analysis of vision encoder embeddings applied to our dataset, we demonstrate that pre-trained vision transformers in MLLMs inherently cluster many iris attack types despite never being explicitly trained for this task. However, where clustering shows overlap between attack classes, we find that structured prompts incorporating human salience (verbal descriptions from subjects identifying attack indicators) enable these models to resolve ambiguities. Testing on an IRB-restricted dataset of 224 iris images spanning seven attack types, using only university-approved services (Gemini 2.5 Pro) or locally-hosted models (e.g., Llama 3.2-Vision), we show that Gemini with expert-informed prompts outperforms both a specialized convolutional neural networks (CNN)-based baseline and human examiners, while the locally-deployable Llama achieves near-human performance. Our results establish that MLLMs deployable within institutional privacy constraints offer a viable path for iris PAD.
Abstract（参考訳）: アイリスの提示攻撃検出(PAD)は、安全な生体認証の展開には不可欠であるが、将来未知の攻撃を表すデータの収集は不可能であり、予測能力の面ではまだ制限されている多種多様なデータの収集は高価である。さらに、生体データを共有することでプライバシーの懸念が高まる。適応可能なソリューションを必要とする新たな攻撃ベクトルの急激な出現により,一般用マルチモーダル大規模言語モデル (MLLM) が,人的知識を付加してアイリスPADを実行可能かどうかを考察し,バイオメトリックデータをパブリッククラウドMLLMサービスに送信することを禁じる厳密なプライバシー制約の下で運用する。我々のデータセットに適用したビジョンエンコーダの埋め込み解析を通じて、MLLMの事前学習されたビジョントランスフォーマーが、このタスクに対して明示的に訓練されていないにもかかわらず、本質的に多くのアイリスアタックタイプをクラスタ化することを示した。しかしながら、クラスタリングが攻撃クラス間で重複している場合、構造化されたプロンプトが人間のサリエンス(攻撃指標を識別する被験者の言葉による記述)を取り入れることで、これらのモデルがあいまいさを解消できることがわかった。大学が承認したサービス(Gemini 2.5 Pro)やローカルホスト型モデル(Llama 3.2-Vision)のみを用いて、7種類の攻撃タイプにまたがる224の虹彩画像のIRB制限データセットを検証したところ、専門家によるプロンプトによるジェミニは、CNNベースの特殊な畳み込みニューラルネットワーク(CNN)ベースのベースラインと人間の検査者の両方より優れており、ローカルにデプロイ可能なLlamaは、ほぼ人間に近いパフォーマンスを実現していることがわかった。この結果,機関プライバシ制約内に展開可能なMLLMは,アイリスPADに有効な経路であることが判明した。

論文の概要: Generalist Multimodal LLMs Gain Biometric Expertise via Human Salience

関連論文リスト