Fugu-MT 論文翻訳(概要): ViT-Up: Faithful Feature Upsampling for Vision Transformers

論文の概要: ViT-Up: Faithful Feature Upsampling for Vision Transformers

arxiv url: http://arxiv.org/abs/2606.14024v1
Date: Fri, 12 Jun 2026 01:55:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-15 16:00:42.705341
Title: ViT-Up: Faithful Feature Upsampling for Vision Transformers
Title（参考訳）: ViT-Up: ビジョントランスフォーマーのための忠実な機能アップサンプリング
Authors: Krispin Wandel, Jingchuan Wang, Hesheng Wang,
Abstract要約: 視覚変換器(ViT)は視覚表現学習において支配的なアーキテクチャとなっている。 ViTは、グローバルな自己注意の二次的なコストのため、比較的小さなパッチ・トーケン・グリッドで一般的に運用される。我々は、外部画像ガイダンスを中間のViT隠蔽状態から階層的にクエリ構造に置き換えるフレームワークであるViT-Upを紹介する。
参考スコア（独自算出の注目度）: 19.84545321813943
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT-Up scales favorably with backbone capacity.
Abstract（参考訳）: 視覚変換器(ViT)は視覚表現学習において支配的なアーキテクチャとなり、非常に強力で広く再利用可能なバックボーン機能を提供する。しかし、ViTは、大域的自己注意の二次的コストのため、比較的小さなパッチ・トーケン・グリッド上で動作し、セマンティックセグメンテーションや深さ推定のような密集した予測タスクに永続的なボトルネックを生じさせる。これはタスクに依存しない機能アップサンプラーの開発を動機付けている。最近の最先端の手法は視覚的に鋭い濃密な表現を生成するが、ガイドアップサンプリングのための浅層画像エンコーダに依存しているため、特徴漏れ、断片化、ぼやけを引き起こすことがある。我々はViT-Upを紹介した。これは暗黙的な機能アップサンプリングフレームワークで、外部イメージガイダンスを中間のViT隠蔽状態から階層的にクエリ構造に置き換える。これにより、バックボーンの特徴空間とのアライメントを維持しながら、任意の連続した画像座標における特徴予測が可能になる。実験により、ViT-Upは高密度の予測とセマンティック対応によって、最先端のイメージガイドアップサンプラーよりも一貫して優れていることが示された。 DINOv3-S+では、ViT-Upは以前の方法よりも、Cityscapesでは+2.07 mIoU、SPair-71kでは+4.17 PCK@0.10に改善されている。より大きなDINOv3-Bのバックボーンにより、これらのゲインは+3.36 mIoUと+8.09 PCK@0.10に増加し、ViT-Upはバックボーンの容量で良好にスケールすることを示した。

論文の概要: ViT-Up: Faithful Feature Upsampling for Vision Transformers

関連論文リスト