Fugu-MT 論文翻訳(概要): Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning

論文の概要: Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning

arxiv url: http://arxiv.org/abs/2604.08147v1
Date: Thu, 09 Apr 2026 12:08:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:05.901831
Title: Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning
Title（参考訳）: 教師指導型デュアルパス映像表現学習による意味雑音の低減
Authors: Linge Wang, Yingying Chen, Bingke Zhu, Lu Zhou, Jinqiao Wang,
Abstract要約: TG-DPは教師主導のデュアルパスフレームワークで、再構築とアライメントを別々の最適化パスに分離する。 TG-DPはゼロショット検索において最先端の性能を達成する。
参考スコア（独自算出の注目度）: 33.481995795091045
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches designed for reconstruction rather than cross-modal alignment, introducing semantic noise and optimization interference. We propose TG-DP, a Teacher-Guided Dual-Path framework that decouples reconstruction and alignment into separate optimization paths. By disentangling the masking regimes of the two branches, TG-DP enables the contrastive pathway to use a visibility pattern better suited to cross-modal alignment. A teacher model further provides auxiliary guidance for organizing visible tokens in this branch, helping reduce interference and stabilize cross-modal representation learning. TG-DP achieves state-of-the-art performance in zero-shot retrieval. On AudioSet, it improves R@1 from 35.2\% to 37.4\% for video-to-audio retrieval and from 27.9\% to 37.1\% for audio-to-video retrieval. The learned representations also remain semantically robust, achieving state-of-the-art linear-probe performance on AS20K and VGGSound. Taken together, our results suggest that decoupling multimodal objectives and introducing teacher-guided structure into the contrastive pathway provide an effective framework for improving large-scale audio-visual pretraining. Code is available at https://github.com/wanglg20/TG-DP.
Abstract（参考訳）: 近年の音声・視覚表現学習の進歩は、コントラストアライメントとマスク再構成を併用する価値を示している。しかし、これらの目的を単一のフォワードパスで共同最適化することで、コントラッシブブランチはクロスモーダルアライメントではなく、ランダムに見えるパッチに頼らざるを得なくなり、セマンティックノイズと最適化干渉が導入される。本稿では,TG-DPを提案する。TG-DPは教師指導型デュアルパスフレームワークで,再構築とアライメントを個別に最適化パスに分離する。 TG-DPは2つの枝のマスキング機構を切り離すことで、横断的アライメントに適した可視性パターンを使用することを可能にした。教師モデルは、このブランチで目に見えるトークンを整理するための補助的なガイダンスを提供する。 TG-DPはゼロショット検索において最先端の性能を達成する。 AudioSetでは、R@1を35.2\%から37.4\%に改善し、27.9\%から37.1\%に改善している。学習された表現も意味的に堅牢であり、AS20KとVGGSoundで最先端の線形プローブ性能を達成する。その結果,マルチモーダルな目的を分離し,教師が指導する構造をコントラッシブな経路に導入することは,大規模オーディオ・視覚前訓練の改善に有効な枠組みであることが示唆された。コードはhttps://github.com/wanglg20/TG-DPで入手できる。

論文の概要: Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning

関連論文リスト