Fugu-MT 論文翻訳(概要): COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

論文の概要: COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

arxiv url: http://arxiv.org/abs/2605.29628v1
Date: Thu, 28 May 2026 09:00:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.091391
Title: COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings
Title（参考訳）: COMET:オーディオテキストマルチモーダルコントラスト埋め込みにおけるモダリティギャップの概念空間分割
Authors: Yonggang Zhu, Liting Gao, Aidong Men, Wenwu Wang,
Abstract要約: CLAPの最小二乗特異値分解フレームワークであるCOMETを紹介する。我々のフレームワークは、共有概念をキャプチャする小さな、解釈可能な部分集合のみが、ほぼ類似性に寄与することを明らかにする。トレーニング不要な方法でモダリティギャップを緩和する簡易なスペクトルトランケーション法を提案する。
参考スコア（独自算出の注目度）: 17.01138431493397
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Existing explanations mainly attribute this gap to the cone effect, treating it as a shift between mean embeddings, yet correcting the mean alone yields only limited improvements. Alternative hypotheses, such as information imbalance and dimensionality collapse, have also been proposed, but they remain insufficiently verified and have not been thoroughly studied in the audio domain. Meanwhile, several works attempt to decompose multimodal contrastive embeddings into interpretable concepts, but none explicitly analyze the modality gap from the perspective of concept decomposition. In this work, we introduce COMET (Concept space Organization and Modality gap Explanation with PLS-SVD Transformation), a novel partial least squares singular value decomposition (PLS-SVD) framework for CLAP that unveils a broader perspective of the modality gap. Our framework reveals that only a small, interpretable subset of axes, which captures shared concepts, contributes substantially to similarity computation, and that the mean component represents only partially the modality gap. Building on this insight, we propose a simple spectral truncation method that mitigates the modality gap in a training-free manner. The method enables zero-shot audio captioning with condition swapping to approach fully supervised performance, without requiring large auxiliary memory banks or expensive computation. At the same time, it achieves substantial embedding dimensionality reduction while preserving strong performance on retrieval and audio captioning tasks.
Abstract（参考訳）: コントラスト言語-監査事前訓練(CLAP)モデルは、多くのゼロショットアプリケーションにおいて、音声理解とモダリティ非依存条件スワップのサポートに広く利用されている。しかし、その性能は、音声とテキストの埋め込みの間のモダリティのギャップによって大きく影響を受ける。既存の説明では、このギャップは主に円錐効果に起因し、平均埋め込みのシフトとして扱うが、平均だけを修正すれば、限られた改善しか得られない。情報不均衡や次元崩壊といった別の仮説も提案されているが、その検証は不十分であり、オーディオ領域では十分に研究されていない。一方、いくつかの研究は多モードのコントラスト埋め込みを解釈可能な概念に分解しようとするが、概念分解の観点からモダリティギャップを明示的に分析するものではない。本研究では,CLAPの最小二乗特異値分解(PLS-SVD)フレームワークであるCOMET(Concept Space Organization and Modality gap Explanation with PLS-SVD Transformation)を紹介する。我々のフレームワークは、共有概念をキャプチャする小さな解釈可能な部分集合のみが類似性計算に実質的に寄与し、平均成分が部分的にモダリティギャップのみを表すことを明らかにしている。この知見に基づいて、トレーニング不要な方法でモダリティギャップを緩和する単純なスペクトルトランケーション法を提案する。コンディションスワップ付きゼロショット音声キャプションにより、大規模な補助記憶バンクや高価な計算を必要とせず、完全に教師付きパフォーマンスに近づくことができる。同時に、検索および音声キャプションタスクの強い性能を維持しながら、かなりの埋め込み次元の低減を実現している。

論文の概要: COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

関連論文リスト