Fugu-MT 論文翻訳(概要): Just Noticeable Difference for Large Multimodal Models

論文の概要: Just Noticeable Difference for Large Multimodal Models

arxiv url: http://arxiv.org/abs/2507.00490v2
Date: Wed, 02 Jul 2025 13:58:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-03 14:22:59.450436
Title: Just Noticeable Difference for Large Multimodal Models
Title（参考訳）: 大規模マルチモーダルモデルに対する注目すべき相違点
Authors: Zijian Chen, Yuan Tian, Yuze Sun, Wei Sun, Zicheng Zhang, Weisi Lin, Guangtao Zhai, Wenjun Zhang,
Abstract要約: 目立った違い(JND)は、人間の視覚システム(HVS)が知覚できる最小限の変化である。初期の試みとして、現在のLMMには視覚盲点があることを実証する。本研究は,LMM研究のユニークな視点として,LMM-JNDの重要性を浮き彫りにしている。
参考スコア（独自算出の注目度）: 70.41467229325345
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Just noticeable difference (JND), the minimum change that the human visual system (HVS) can perceive, has been studied for decades. Although recent work has extended this line of research into machine vision, there has been a scarcity of studies systematically exploring its perceptual boundaries across multiple tasks and stimulus types, particularly in the current era of rapidly advancing large multimodal models (LMMs), where studying the multifaceted capabilities of models has become a mainstream focus. Moreover, the perceptual defects of LMMs are not investigated thoroughly, resulting in potential security issues and suboptimal response efficiency. In this paper, we take an initial attempt and demonstrate that there exist significant visual blind spots in current LMMs. To systemically quantify this characteristic, we propose a new concept, {\bf LMM-JND}, together with its determination pipeline. Targeting uncovering the behavior commonalities in HVS-aligned visual perception tasks, we delve into several LMM families and construct a large-scale dataset, named VPA-JND, which contains 21.5k reference images with over 489k stimuli across 12 distortion types, to facilitate LMM-JND studies. VPA-JND exposes areas where state-of-the-art LMMs, including GPT-4o and the InternVL2.5 series, struggle with basic comparison queries and fall significantly short of human-level visual performance. We further explore the effects of vision and language backbones and find a notable correlation between their design philosophy that may instruct the future refinement of LMMs for their visual acuity. Together, our research underscores the significance of LMM-JND as a unique perspective for studying LMMs, and predictable LMM-JND is crucial for security concerns. This work will be available at https://github.com/zijianchen98/LMM-JND.
Abstract（参考訳）: 人間の視覚システム(HVS)が知覚できる最小限の変化である、目立った違い(JND)は、何十年も研究されてきた。最近の研究は、マシンビジョンにこの研究線を拡大しているが、複数のタスクや刺激タイプにまたがる知覚境界を体系的に探究する研究は少なく、特に、モデル多面体能力の研究が主流となっているLMM(英語版)が急速に進歩している時代においてである。さらに、LMMの知覚的欠陥は徹底的に調べられず、潜在的なセキュリティ問題や準最適応答効率がもたらされる。本稿では,本研究の最初の試みとして,現在のLMMに重要な視覚盲点が存在することを実証する。この特徴を体系的に定量化するために,新たな概念である {\bf LMM-JND} と決定パイプラインを提案する。複数のLMMファミリを探索し,12種類の歪みに対して489k以上の刺激を持つ21.5k以上の参照画像を含む大規模データセットを構築し,LMM-JND研究を容易にする。 VPA-JNDは、GPT-4oやInternVL2.5シリーズを含む最先端のLMMが基本的な比較クエリに苦しむ領域を公開し、人間レベルの視覚性能を著しく低下させる。さらに、視覚と言語のバックボーンの効果を探求し、その視覚力のために将来のLMMの洗練を指示する設計哲学の間に顕著な相関関係を見出す。本研究は,LMM研究のユニークな視点として,LMM-JNDの重要性を強調し,LMM-JNDがセキュリティ上の問題に不可欠であることを示す。この作業はhttps://github.com/zijianchen98/LMM-JNDで公開される。

論文の概要: Just Noticeable Difference for Large Multimodal Models

関連論文リスト