Fugu-MT 論文翻訳(概要): Beyond Detection: A Structure-Aware Framework for Scene Text Tracking

論文の概要: Beyond Detection: A Structure-Aware Framework for Scene Text Tracking

arxiv url: http://arxiv.org/abs/2605.17270v1
Date: Sun, 17 May 2026 05:40:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:47.820621
Title: Beyond Detection: A Structure-Aware Framework for Scene Text Tracking
Title（参考訳）: Beyond Detection: シーンテキスト追跡のための構造認識フレームワーク
Authors: Chenmin Yu, Liu Yu, Daiqing Wu, Gengluo Li, Zeyu Chen, Yu Zhou,
Abstract要約: 本稿では、この特定のタスクをScene Text Trackingとして定式化する。そこで我々はSymTrackを提案する。SymTrackは、シナジスティックなデュアルブランチ設計を備えた、一貫した検出不要のフレームワークである。このタスクに専用のベンチマークがないので、ビデオテキストスポッティングの3つのデータセットを使用して、高品質なアノテーションによるベンチマークを構築する。
参考スコア（独自算出の注目度）: 15.940149796254955
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Modern visual object trackers show impressive results on general targets, yet their performance drops substantially when dealing with scene text. Although currently underexplored, tracking text in videos is essential for dynamic text manipulations such as segmentation, removal, and editing. To fill this gap, this paper formalizes this specific task as Scene Text Tracking and presents the first systematic work for it. We identify three primary challenges in this task: 1) severe geometric distortions from perspective shifts, 2) high visual ambiguity across different instances, and 3) high sensitivity to fine-grained structural details. To address these issues, we propose SymTrack, a unified detection-free framework with synergistic dual-branch design. It integrates a Cross-Expert Calibration mechanism to reduce semantic bias, along with a Predictive Token Rectification mechanism to correct structural imbalances, complemented by an Adaptive Inference Engine that stabilizes predictions under motion constraints. Considering the lack of dedicated benchmarks for this task, we utilize three datasets from video text spotting to construct a benchmark with high-quality annotations. Extensive experiments demonstrate that SymTrack sets the new state-of-the-art on all three benchmarks, outperforming previous best trackers by up to 11.97\% AUC on $ \text{BOVText}_{\text{SOT}} $. Overall, our work promotes efficient and thorough text tracking, paving the way toward more generalized video text manipulation.
Abstract（参考訳）: 現代のビジュアルオブジェクトトラッカーは、一般的なターゲットに対して印象的な結果を示すが、シーンテキストを扱う場合のパフォーマンスは大幅に低下する。現在はまだ探索されていないが、ビデオ中のテキストの追跡はセグメンテーション、削除、編集といった動的なテキスト操作に不可欠である。このギャップを埋めるために、本稿では、この特定のタスクをScene Text Trackingとして形式化し、最初の体系的な作業を示す。私たちは、このタスクにおける3つの主要な課題を特定します。 1) 視点シフトによる厳密な幾何学的歪み 2 異なる事例にまたがる高い視覚的あいまいさ、及び 3)細粒度構造に対する感度が高い。このような問題に対処するため,SymTrackを提案する。セマンティックバイアスを減らすためのクロスエキスパートキャリブレーション機構と、構造的不均衡を補正する予測トークン修正機構を統合し、動作制約下での予測を安定化する適応推論エンジンを補完する。このタスクに専用のベンチマークが存在しないことを考えると、ビデオテキストスポッティングの3つのデータセットを使用して、高品質なアノテーションによるベンチマークを構築する。大規模な実験により、SymTrackは3つのベンチマークすべてで新しい最先端をセットし、$ \text{BOVText}_{\text{SOT}} $で以前の最高のトラッカーを最大11.97\%AUCで上回った。全体として、我々の研究は効率的かつ徹底的なテキスト追跡を促進し、より一般化されたビデオテキスト操作への道を開く。

論文の概要: Beyond Detection: A Structure-Aware Framework for Scene Text Tracking

関連論文リスト