Fugu-MT 論文翻訳(概要): Learning to Track Instance from Single Nature Language Description

論文の概要: Learning to Track Instance from Single Nature Language Description

arxiv url: http://arxiv.org/abs/2605.07064v1
Date: Fri, 08 May 2026 00:17:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.693538
Title: Learning to Track Instance from Single Nature Language Description
Title（参考訳）: 単一自然言語記述からインスタンスを追跡する学習
Authors: Yaozong Zheng, Bineng Zhong, Qihua Liang, Shuimu Zeng, Haiying Xia, Shuxiang Song,
Abstract要約: 我々は、新しい自己教師型視覚言語トラッカーであるtextbftracker を紹介する。言語記述によって参照対象を追跡することができる。 VLトラッキングベンチマークの実験では、トラッカーがSOTAの自己管理手法を超越していることが示されている。
参考スコア（独自算出の注目度）: 35.712922010701014
License: http://creativecommons.org/licenses/by/4.0/
Abstract: How to achieve vision-language (VL) tracking using natural language descriptions from a video sequence \textbf{without relying on any bounding-box ground truth}? In this work, we achieve this goal by tackling \textit{self-supervised VL tracking}, which aims to evaluate tracking capabilities guided by natural language descriptions. We introduce \textbf{\tracker}, a novel self-supervised VL tracker that is capable of tracking any referred object by a language description. Unlike traditional methods that equally fuse all language and visual tokens, we propose an efficient Dynamic Token Aggregation Module, which treats each visual token \textbf{unequally}. The module consists of three main steps: i) Based on an anchor token, it selects multiple important target tokens from the template frame. ii) The selected target tokens are merged according to their attention scores and aggregated into the language tokens, thereby eliminating redundant visual token noise and enhancing semantic alignment. iii) Finally, the fused language tokens serve as guiding signals to extract potential target tokens from the search frame and propagate them to subsequent frames, enhancing temporal prompts and encouraging the tracker to autonomously learn instance tracking from unlabeled videos. This new modeling approach enables the effective self-supervised learning of language-guided tracking representations without the need for large-scale bounding box annotations. Extensive experiments on VL tracking benchmarks show that {\tracker} surpasses SOTA self-supervised methods.
Abstract（参考訳）: ビデオシーケンス \textbf{without からの自然言語記述を用いた視覚言語(VL)追跡の実現法本研究は,自然言語記述によるトラッキング機能の評価を目的とした,textit{self-supervised VL tracking} に取り組むことで,この目標を達成する。本稿では,言語記述による参照対象の追跡が可能な,新しい自己教師型VLトラッカーである‘textbf{\tracker}を紹介する。すべての言語と視覚トークンを等しく融合させる従来の方法とは異なり、各視覚トークンを扱い、効率的な動的トークン集約モジュールを提案する。モジュールは3つの主要なステップから構成される。 i)アンカートークンに基づいて、テンプレートフレームから複数の重要なターゲットトークンを選択する。二選択した目標トークンは、その注意点に応じてマージし、言語トークンに集約することにより、冗長な視覚トークンノイズを排除し、セマンティックアライメントを強化する。三最後に、融合言語トークンは、検索フレームから潜在的標的トークンを抽出し、後続のフレームに伝播し、時間的プロンプトを強化し、追跡者が未ラベルのビデオからインスタンス追跡を自律的に学習するように促すための誘導信号として機能する。この新しいモデリングアプローチにより、大規模境界ボックスアノテーションを必要とせずに、言語誘導型トラッキング表現の効果的な自己教師付き学習が可能になる。 VL追跡ベンチマークの大規模な実験により、 {\tracker} は SOTA の自己管理手法を超越していることが示された。

論文の概要: Learning to Track Instance from Single Nature Language Description

関連論文リスト