Fugu-MT 論文翻訳(概要): VSD-MOT: End-to-End Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Distillation

論文の概要: VSD-MOT: End-to-End Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Distillation

arxiv url: http://arxiv.org/abs/2603.20731v1
Date: Sat, 21 Mar 2026 09:33:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.061618
Title: VSD-MOT: End-to-End Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Distillation
Title（参考訳）: VSD-MOT:視覚的セマンティック蒸留による低品質映像シーンにおけるエンド・ツー・エンド多目的追跡
Authors: Jun Du,
Abstract要約: 既存のマルチオブジェクト追跡アルゴリズムは、通常、低品質のビデオの問題を適切に解決できない。視覚的意味蒸留(VSD-MOT)により誘導される多目的追跡フレームワークを提案する。低品質ビデオにおけるフレーム品質の動的変動に対処するために,動的セマンティック・ウェイト・レギュレーション(DSWR)モジュールを提案する。
参考スコア（独自算出の注目度）: 12.844814515209654
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing multi-object tracking algorithms typically fail to adequately address the issues in low-quality videos, resulting in a significant decline in tracking performance when image quality deteriorates in real-world scenarios. This performance degradation is primarily due to the algorithms' inability to effectively tackle the problems caused by information loss in low-quality images. To address the challenges of low-quality video scenarios, inspired by vision-language models, we propose a multi-object tracking framework guided by visual semantic distillation (VSD-MOT). Specifically, we introduce the CLIP Image Encoder to extract global visual semantic information from images to compensate for the loss of information in low-quality images. However, direct integration can substantially impact the efficiency of the multi-object tracking algorithm. Therefore, this paper proposes to extract visual semantic information from images through knowledge distillation. This method adopts a teacher-student learning framework, with the CLIP Image Encoder serving as the teacher model. To enable the student model to acquire the capability of extracting visual semantic information suitable for multi-object tracking tasks from the teacher model, we have designed the Dual-Constraint Semantic Distillation method (DCSD). Furthermore, to address the dynamic variation of frame quality in low-quality videos, we propose the Dynamic Semantic Weight Regulation (DSWR) module, which adaptively allocates fusion weights based on real-time frame quality assessment. Extensive experiments demonstrate the effectiveness and superiority of the proposed method in low-quality video scenarios in the real world. Meanwhile, our method can maintain good performance in conventional scenarios.
Abstract（参考訳）: 既存のマルチオブジェクト追跡アルゴリズムは、通常、低品質のビデオの問題を適切に解決することができず、現実のシナリオで画質が劣化すると、追跡性能が著しく低下する。この性能劣化は主に、低画質画像における情報損失に起因する問題にアルゴリズムが効果的に対処できないためである。視覚言語モデルにインスパイアされた低品質映像シナリオの課題に対処するため,視覚的意味蒸留(VSD-MOT)によって誘導される多目的追跡フレームワークを提案する。具体的には、CLIP画像エンコーダを導入し、画像からグローバルな視覚的意味情報を抽出し、低品質画像における情報の損失を補う。しかし、直接統合は多目的追跡アルゴリズムの効率に大きな影響を及ぼす可能性がある。そこで本研究では,知識蒸留を通じて画像から視覚意味情報を抽出することを提案する。この方法は、CLIPイメージエンコーダが教師モデルとして機能する教師学習フレームワークを採用する。教師モデルから多目的追跡タスクに適した視覚的意味情報を抽出する能力を得るために,教師モデルを用いてDual-Constraint Semantic Distillation法(DCSD)を設計した。さらに,低品質ビデオにおけるフレーム品質の動的変動に対処するため,リアルタイムなフレーム品質評価に基づいて,融合重みを適応的に割り当てる動的セマンティック・ウェイト・レギュレーション(DSWR)モジュールを提案する。実世界の低品質映像シナリオにおいて,提案手法の有効性と優位性を示す実験を行った。一方,本手法は従来のシナリオでは良好な性能を維持することができる。

論文の概要: VSD-MOT: End-to-End Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Distillation

関連論文リスト