Fugu-MT 論文翻訳(概要): Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap

論文の概要: Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap

arxiv url: http://arxiv.org/abs/2510.11330v1
Date: Mon, 13 Oct 2025 12:25:33 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:30.356876
Title: Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap
Title（参考訳）: Diffusion-Link:Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap
Authors: KiHyun Nam, Jongmin Choi, Hyeongkeun Lee, Jungwoo Heo, Joon Son Chung,
Abstract要約: Diffusion-Linkは拡散に基づくモダリティブリッジングモジュールである。オーディオ埋め込みをテキスト埋め込み分布にマッピングする。これは、拡散に基づくモダリティブリッジによる自動オーディオキャプションへの最初の応用である。
参考スコア（独自算出の注目度）: 36.21722709167031
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP blocks. To assess the effect of Diffusion-Link on multimodal encoder-LLM coupling, we evaluate on Automatic Audio Captioning (AAC); to our knowledge, this is the first application of diffusion-based modality bridging to AAC. We report two results. (1) Modality-gap analysis: on similarity and geometric criteria, Diffusion-Link reduces the modality gap the most among prior diffusion-based methods and shows a collective migration of audio embeddings toward the text distribution. (2) Downstream AAC: attaching Diffusion-Link to the same multimodal LLM baseline achieves state-of-the-art on AudioCaps in both zero-shot and fully supervised captioning without external knowledge, with relative gains up to 52.5% and 7.5%, respectively. These findings show that closing the modality gap is pivotal for effective coupling between multimodal encoders and LLMs, and diffusion-based modality bridging offers a promising direction beyond knowledge-retrieval-centric designs. Code will be released upon acceptance https://github.com/DevKiHyun/Diffusion-Link
Abstract（参考訳）: 対照的に、音声-言語事前学習は強力な関節表現をもたらすが、持続的な音声-テキストのモダリティギャップは、多モードエンコーダと大きな言語モデル(LLM)を結合する利点を制限している。拡散に基づくモダリティブリッジングモジュールであるDiffusion-Linkについて,音声埋め込みをテキスト埋め込み分布に生成的にマッピングする。モジュールは、凍結したマルチモーダルエンコーダからの出力埋め込みで訓練され、3つの残余のMLPブロックを持つ軽量ネットワークとして実装された。拡散リンクがマルチモーダルエンコーダ-LLM結合に与える影響を評価するため,我々はAAC(Automatic Audio Captioning)の評価を行った。私たちは2つの結果を報告します。 1) モーダルギャップ解析: 類似性および幾何学的基準に基づき, 拡散リンクは, 先行拡散法で最大となるモダリティギャップを減らし, テキスト分布へのオーディオ埋め込みの集合的移動を示す。 2)下流AAC:Diffusion-Linkを同一のマルチモーダルLCMベースラインにアタッチすることで、ゼロショットと完全教師付きキャプションの両方でAudioCapsの最先端を実現する。これらの結果から,マルチモーダルエンコーダとLLMの効果的な結合には,モダリティギャップの閉鎖が重要であることが示唆された。 https://github.com/DevKiHyun/Diffusion-Link

論文の概要: Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap

関連論文リスト