Fugu-MT 論文翻訳(概要): Human-like Object Grouping in Self-supervised Vision Transformers

論文の概要: Human-like Object Grouping in Self-supervised Vision Transformers

arxiv url: http://arxiv.org/abs/2603.13994v1
Date: Sat, 14 Mar 2026 15:43:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.530864
Title: Human-like Object Grouping in Self-supervised Vision Transformers
Title（参考訳）: 自己監督型視覚変換器におけるヒューマンライクなオブジェクトグループ化
Authors: Hossein Adeli, Seoyoung Ahn, Andrew Luo, Mengmi Zhang, Nikolaus Kriegeskorte, Gregory Zelinsky,
Abstract要約: 本稿では,自然主義的な場面における点対に対する同一・異なる対象判断を行う行動ベンチマークを提案する。我々は、被験者の反応時間を予測するために、その表現からの単純な読み出しを用いて、多様な視覚モデルをテストする。自己教師型視覚モデルでは, 物体の構造を人間の行動的に捉え, グラム行列構造が知覚的アライメントを駆動する役割を担っていることを示す。
参考スコア（独自算出の注目度）: 9.933177928703172
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision foundation models trained with self-supervised objectives achieve strong performance across diverse tasks and exhibit emergent object segmentation properties. However, their alignment with human object perception remains poorly understood. Here, we introduce a behavioral benchmark in which participants make same/different object judgments for dot pairs on naturalistic scenes, scaling up a classical psychophysics paradigm to over 1000 trials. We test a diverse set of vision models using a simple readout from their representations to predict subjects' reaction times. We observe a steady improvement across model generations, with both architecture and training objective contributing to alignment, and transformer-based models trained with the DINO self-supervised objective showing the strongest performance. To investigate the source of this improvement, we propose a novel metric to quantify the object-centric component of representations by measuring patch similarity within and between objects. Across models, stronger object-centric structure predicts human segmentation behavior more accurately. We further show that matching the Gram matrix of supervised transformer models, capturing similarity structure across image patches, with that of a self-supervised model through distillation improves their alignment with human behavior, converging with the prior finding that Gram anchoring improves DINOv3's feature quality. Together, these results demonstrate that self-supervised vision models capture object structure in a behaviorally human-like manner, and that Gram matrix structure plays a role in driving perceptual alignment.
Abstract（参考訳）: 自己教師対象で訓練された視覚基礎モデルは、多様なタスクにまたがって強力なパフォーマンスを達成し、創発的なオブジェクトセグメンテーション特性を示す。しかし、人間の物体知覚との整合性はいまだによく理解されていない。本稿では,古典心理学のパラダイムを1000以上の試行に拡張した,自然主義的な場面における点対に対する同一・異なる対象判断を行う行動ベンチマークを提案する。我々は、被験者の反応時間を予測するために、その表現からの単純な読み出しを用いて、多様な視覚モデルをテストする。我々は、アーキテクチャとトレーニング目的の両方がアライメントに寄与し、DINOの自己監督目標によって訓練されたトランスフォーマーベースモデルにより、モデル世代間で着実に改善されていることを観察する。そこで本研究では,オブジェクト間のパッチ類似度を測定することにより,オブジェクト中心の表現成分を定量化する手法を提案する。モデル全体で、より強力なオブジェクト中心構造は、人間のセグメンテーションの振る舞いをより正確に予測する。さらに, 教師付きトランスフォーマーモデルのグラム行列のマッチング, イメージパッチ間の類似性構造と蒸留による自己監督モデルとのマッチングにより, 人間の行動との整合性が向上し, グラムアンカーがDINOv3の特徴的品質を向上することを示す。これらの結果は、自己監督型視覚モデルが、行動的人間的な方法で物体構造を捉え、グラム行列構造が知覚的アライメントを駆動する役割を担っていることを示す。

論文の概要: Human-like Object Grouping in Self-supervised Vision Transformers

関連論文リスト