Fugu-MT 論文翻訳(概要): Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking

論文の概要: Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking

arxiv url: http://arxiv.org/abs/2506.20381v1
Date: Wed, 25 Jun 2025 12:46:46 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-26 21:00:42.738743
Title: Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking
Title（参考訳）: 高速ビジュアルトラッキングのための軽量階層型ViTと動的フレームワークの爆発的展開
Authors: Ben Kang, Xin Chen, Jie Zhao, Chunjuan Bo, Dong Wang, Huchuan Lu,
Abstract要約: トランスフォーマーをベースとしたビジュアルトラッカーは、その強力なモデリング能力のために大きな進歩を見せている。しかし、処理速度が遅いため、リソース制約のあるデバイスでは実用性に制限がある。各種デバイス間の高速動作を維持しながら高い性能を実現するための,効率的な追跡モデルであるHiTを提案する。
参考スコア（独自算出の注目度）: 49.07982079554859
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based visual trackers have demonstrated significant advancements due to their powerful modeling capabilities. However, their practicality is limited on resource-constrained devices because of their slow processing speeds. To address this challenge, we present HiT, a novel family of efficient tracking models that achieve high performance while maintaining fast operation across various devices. The core innovation of HiT lies in its Bridge Module, which connects lightweight transformers to the tracking framework, enhancing feature representation quality. Additionally, we introduce a dual-image position encoding approach to effectively encode spatial information. HiT achieves an impressive speed of 61 frames per second (fps) on the NVIDIA Jetson AGX platform, alongside a competitive AUC of 64.6% on the LaSOT benchmark, outperforming all previous efficient trackers.Building on HiT, we propose DyHiT, an efficient dynamic tracker that flexibly adapts to scene complexity by selecting routes with varying computational requirements. DyHiT uses search area features extracted by the backbone network and inputs them into an efficient dynamic router to classify tracking scenarios. Based on the classification, DyHiT applies a divide-and-conquer strategy, selecting appropriate routes to achieve a superior trade-off between accuracy and speed. The fastest version of DyHiT achieves 111 fps on NVIDIA Jetson AGX while maintaining an AUC of 62.4% on LaSOT.Furthermore, we introduce a training-free acceleration method based on the dynamic routing architecture of DyHiT. This method significantly improves the execution speed of various high-performance trackers without sacrificing accuracy. For instance, our acceleration method enables the state-of-the-art tracker SeqTrack-B256 to achieve a 2.68 times speedup on an NVIDIA GeForce RTX 2080 Ti GPU while maintaining the same AUC of 69.9% on the LaSOT.
Abstract（参考訳）: トランスフォーマーをベースとしたビジュアルトラッカーは、その強力なモデリング能力のために大きな進歩を見せている。しかし、処理速度が遅いため、リソース制約のあるデバイスでは実用性に制限がある。この課題に対処するために,様々なデバイス間の高速動作を維持しながら高い性能を実現する,効率的な追跡モデルの新たなファミリであるHiTを提案する。 HiTの中核となるイノベーションは、軽量トランスフォーマーをトラッキングフレームワークに接続するBridge Moduleにある。さらに,空間情報を効果的に符号化するためのデュアルイメージ位置符号化手法を提案する。 HiTはNVIDIA Jetson AGXプラットフォーム上で毎秒61フレーム(fps)の驚くべき速度を実現し、LaSOTベンチマークで64.6%の競争力を持つAUCは、これまでのすべての効率的なトラッカーより優れている。HiT上に構築されたDyHiTは、様々な計算要求のあるルートを選択して、シーンの複雑さに柔軟に適応する効率的な動的トラッカーである。 DyHiTは、バックボーンネットワークによって抽出された検索エリア機能を使用して、それらを効率的な動的ルータに入力し、トラッキングシナリオを分類する。分類に基づいて、DyHiTは分割とコンカニオンの戦略を適用し、精度と速度のトレードオフをより優れたものにするための適切なルートを選択する。 DyHiTの最速バージョンはNVIDIA Jetson AGXで111fps、LaSOTで62.4%のAUCを維持している。この方法は、精度を犠牲にすることなく、様々な高性能トラッカーの実行速度を大幅に向上させる。例えば、当社の加速度法では、NVIDIA GeForce RTX 2080 Ti GPUの2.68倍の高速化を実現し、LaSOTでは69.9%のAUCを維持している。

論文の概要: Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking

関連論文リスト