Fugu-MT 論文翻訳(概要): Finding Distributed Object-Centric Properties in Self-Supervised Transformers

論文の概要: Finding Distributed Object-Centric Properties in Self-Supervised Transformers

arxiv url: http://arxiv.org/abs/2603.26127v1
Date: Fri, 27 Mar 2026 07:22:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-30 21:49:48.383014
Title: Finding Distributed Object-Centric Properties in Self-Supervised Transformers
Title（参考訳）: 自己監督型変圧器における分散物体中心特性の探索
Authors: Samyak Rawlekar, Amitabh Swain, Yujun Cai, Yiwei Wang, Ming-Hsuan Yang, Narendra Ahuja,
Abstract要約: 自己監督型視覚変換器(ViT)は、最終層のトークンアテンションマップでよく見られる、オブジェクトを発見できる創発的な能力を示す。これは、[]トークンがイメージレベルの目的に基づいてトレーニングされ、オブジェクトにフォーカスするのではなく、全体像を要約しているためです。我々は、この分散オブジェクト中心情報を抽出するトレーニング不要なObject-DINOを提案する。
参考スコア（独自算出の注目度）: 59.00547715011873
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similarity using patch-level attention components (query, key, and value) across all layers. We find that: (1) Object-centric properties are encoded in the similarity maps derived from all three components ($q, k, v$), unlike prior work that uses only key features or the [CLS] token. (2) This object-centric information is distributed across the network, not just confined to the final layer. Based on these insights, we introduce Object-DINO, a training-free method that extracts this distributed object-centric information. Object-DINO clusters attention heads across all layers based on the similarities of their patches and automatically identifies the object-centric cluster corresponding to all objects. We demonstrate Object-DINO's effectiveness on two applications: enhancing unsupervised object discovery (+3.6 to +12.4 CorLoc gains) and mitigating object hallucination in Multimodal Large Language Models by providing visual grounding. Our results demonstrate that using this distributed object-centric information improves downstream tasks without additional training.
Abstract（参考訳）: DINOのような自己監督型視覚変換器(ViT)は、最終層の[CLS]トークンアテンションマップでよく見られるように、オブジェクトを発見できる創発的な能力を示している。しかし、これらの写像はしばしば急激な活性化を伴い、結果として対象の局所性が劣る。これは、画像レベルの目的に基づいてトレーニングされた[CLS]トークンが、オブジェクトではなく、イメージ全体を要約しているためです。このアグリゲーションは、局所的、パッチレベルの相互作用に存在するオブジェクト中心の情報を希薄化する。パッチレベルのアテンションコンポーネント(クエリ、キー、値)をすべてのレイヤにわたって使用して、パッチ間の類似性を計算してこれを解析する。 1) 重要な特徴や[CLS]トークンのみを使用する以前の作業とは異なり、オブジェクト中心のプロパティは、すべての3つのコンポーネント(q, k, v$)から導かれる類似性マップにエンコードされている。 2) このオブジェクト中心の情報は、最終層に限らず、ネットワーク全体に分散されます。これらの知見に基づいて,この分散オブジェクト中心情報を抽出するトレーニングフリーな手法であるObject-DINOを紹介する。 Object-DINOクラスタは、パッチの類似性に基づいてすべてのレイヤに注目し、すべてのオブジェクトに対応するオブジェクト中心のクラスタを自動的に識別する。我々は、教師なしオブジェクト発見(+3.6から+12.4 CorLocゲイン)の強化と、視覚的なグラウンドニングを提供することで、マルチモーダル大言語モデルにおけるオブジェクト幻覚の緩和という2つのアプリケーションにおけるObject-DINOの有効性を実証する。この分散オブジェクト中心情報を用いることで、付加的なトレーニングを伴わずに下流タスクを改善できることを示す。

論文の概要: Finding Distributed Object-Centric Properties in Self-Supervised Transformers

関連論文リスト