Fugu-MT 論文翻訳(概要): MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

論文の概要: MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

arxiv url: http://arxiv.org/abs/2606.25225v1
Date: Tue, 23 Jun 2026 22:48:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 17:05:30.165168
Title: MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning
Title（参考訳）: MJEPA: オーディオ・ビジュアル・ラーニングのためのシンプルでスケーラブルな統合埋め込み予測アーキテクチャ
Authors: Revant Teotia, Adrien Bardes, Michael Rabbat, Sumit Chopra, Matthew J. Muckley, Nicolas Ballas,
Abstract要約: 大規模ビデオデータからの自己教師付き学習が視覚表現学習の主流のパラダイムとして浮上している。既存の方法は、モダリティ固有のエンコーダと、コントラストや再構成目的の複雑な組み合わせに依存している。両モードで単一の統一エンコーダを使用する音声・視覚学習のための共同埋め込み型予測アーキテクチャであるMJEPAを紹介する。
参考スコア（独自算出の注目度）: 15.707226798418
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-supervised learning from large-scale video data has emerged as a dominant paradigm for visual representation learning. Since audio and visual streams naturally co-occur in video data, extending this success to jointly learn from both modalities is a natural next step, yet it remains challenging. Existing audio-visual self-supervised methods rely on modality-specific encoders and complex combinations of contrastive or reconstruction objectives, limiting cross-modal synergy and scalability. Joint Embedding Predictive Architectures (JEPAs) offer a simple, modality-agnostic alternative, but have to date been applied primarily to individual modalities. We introduce MJEPA, a joint-embedding predictive architecture for audio-visual learning that uses a single, unified encoder for both modalities. Our approach uses only a single predictive objective, applied both within and across modalities. We show that cross-modal prediction is critical: without it, a shared encoder degrades below unimodal baselines; with it, each modality's representation benefits from the other. Our frozen ViT-g model outperforms the best prior frozen baseline by over 6.8 mAP on AudioSet-20K, surpasses fully finetuned models on ESC-50 and FSD50K, and is competitive on video benchmarks despite using 10x less video data.
Abstract（参考訳）: 大規模ビデオデータからの自己教師付き学習が視覚表現学習の主流のパラダイムとして浮上している。オーディオとビジュアルストリームは、ビデオデータに自然に共生しているので、この成功を両モードから共同で学ぶことが、次の自然なステップである。既存の音声・視覚的自己監督手法は、モダリティ固有のエンコーダとコントラストや再構成目的の複雑な組み合わせに依存しており、モダリティ間の相乗効果とスケーラビリティを制限している。共同埋め込み予測アーキテクチャ(JEPA)は、単純でモダリティに依存しない代替手段を提供するが、これまでは主に個々のモダリティに適用されてきた。両モードで単一の統一エンコーダを使用する音声・視覚学習のための共同埋め込み型予測アーキテクチャであるMJEPAを紹介する。私たちのアプローチでは、モダリティ内および横断的に適用される単一の予測目的のみを使用します。共用エンコーダは単調なベースライン以下に分解され、各モダリティの表現は他方から恩恵を受ける。凍結したViT-gモデルは、AudioSet-20Kで6.8mAP以上、ESC-50とFSD50Kで完全に微調整されたモデルより優れており、ビデオデータの10倍少ないにもかかわらず、ビデオベンチマークでは競争力がある。

論文の概要: MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

関連論文リスト