Fugu-MT 論文翻訳(概要): What You Have is What You Track: Adaptive and Robust Multimodal Tracking

論文の概要: What You Have is What You Track: Adaptive and Robust Multimodal Tracking

arxiv url: http://arxiv.org/abs/2507.05899v1
Date: Tue, 08 Jul 2025 11:40:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-09 16:34:37.949016
Title: What You Have is What You Track: Adaptive and Robust Multimodal Tracking
Title（参考訳）: アダプティブでロバストなマルチモーダルトラッキング
Authors: Yuedong Tan, Jiawei Shao, Eduard Zamfir, Ruanjun Li, Zhaochong An, Chao Ma, Danda Paudel, Luc Van Gool, Radu Timofte, Zongwei Wu,
Abstract要約: 本研究では,時間的に不完全なマルチモーダルデータを用いたトラッカー性能に関する総合的研究を行った。我々のモデルは9つのベンチマークでSOTA性能を達成し、従来の完全性と欠落したモダリティ設定の両方で優れている。
参考スコア（独自算出の注目度）: 72.92244578461869
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal data is known to be helpful for visual tracking by improving robustness to appearance variations. However, sensor synchronization challenges often compromise data availability, particularly in video settings where shortages can be temporal. Despite its importance, this area remains underexplored. In this paper, we present the first comprehensive study on tracker performance with temporally incomplete multimodal data. Unsurprisingly, under such a circumstance, existing trackers exhibit significant performance degradation, as their rigid architectures lack the adaptability needed to effectively handle missing modalities. To address these limitations, we propose a flexible framework for robust multimodal tracking. We venture that a tracker should dynamically activate computational units based on missing data rates. This is achieved through a novel Heterogeneous Mixture-of-Experts fusion mechanism with adaptive complexity, coupled with a video-level masking strategy that ensures both temporal consistency and spatial completeness which is critical for effective video tracking. Surprisingly, our model not only adapts to varying missing rates but also adjusts to scene complexity. Extensive experiments show that our model achieves SOTA performance across 9 benchmarks, excelling in both conventional complete and missing modality settings. The code and benchmark will be publicly available at https://github.com/supertyd/FlexTrack/tree/main.
Abstract（参考訳）: マルチモーダルデータは、外観変化に対するロバスト性を改善することにより、視覚的追跡に役立つことが知られている。しかし、センサー同期の課題はしばしばデータの可用性を損なう。その重要性にもかかわらず、この地域は未探検のままである。本稿では、時間的に不完全なマルチモーダルデータを用いたトラッカー性能に関する最初の総合的研究について述べる。このような状況下では、既存のトラッカーは、その厳密なアーキテクチャは、欠落したモダリティを効果的に処理するために必要な適応性に欠けるため、大幅な性能低下を示す。これらの制約に対処するため,ロバストなマルチモーダルトラッキングのためのフレキシブルなフレームワークを提案する。トラッカーは、欠落したデータ率に基づいて、動的に計算ユニットを活性化すべきである。これは、適応的な複雑さを持つ新しいヘテロジニアス・ミックス・オブ・エキスパート融合機構と、効果的なビデオトラッキングに不可欠な時間的一貫性と空間的完全性の両方を保証するビデオレベルのマスキング戦略を組み合わせることで実現される。驚いたことに、我々のモデルは様々な欠落率に適応するだけでなく、シーンの複雑さにも適応する。 9つのベンチマークでSOTA性能を実現し,従来の完全性と欠如性の両方に優れていた。コードとベンチマークはhttps://github.com/supertyd/FlexTrack/tree/mainで公開される。

論文の概要: What You Have is What You Track: Adaptive and Robust Multimodal Tracking

関連論文リスト