Fugu-MT 論文翻訳(概要): Joint Multimodal Contrastive Learning for Robust Spoken Term Detection and Keyword Spotting

論文の概要: Joint Multimodal Contrastive Learning for Robust Spoken Term Detection and Keyword Spotting

arxiv url: http://arxiv.org/abs/2512.14115v1
Date: Tue, 16 Dec 2025 05:58:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-17 16:49:26.611039
Title: Joint Multimodal Contrastive Learning for Robust Spoken Term Detection and Keyword Spotting
Title（参考訳）: ロバストスポット項検出とキーワードスポッティングのための共同マルチモーダルコントラスト学習
Authors: Ramesh Gundluru, Shubham Gupta, Sri Rama Murty K,
Abstract要約: 本研究では,共用組込み空間における音響・クロスモーダル監視を統一するマルチモーダルコントラスト学習フレームワークを提案する。 i) CLAPの損失にインスパイアされた音声テキストのコントラスト学習と, (ii) 音声音声のコントラスト学習をDeep Word Discrimination (DWD) の損失で同時に最適化し, クラス内コンパクト性とクラス間分離性を高める。提案手法は,STDとKWSの両方を柔軟にサポートしながら,単語識別タスクにおける既存のAWEベースラインよりも優れている。
参考スコア（独自算出の注目度）: 13.48022380380599
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Acoustic Word Embeddings (AWEs) improve the efficiency of speech retrieval tasks such as Spoken Term Detection (STD) and Keyword Spotting (KWS). However, existing approaches suffer from limitations, including unimodal supervision, disjoint optimization of audio-audio and audio-text alignment, and the need for task-specific models. To address these shortcomings, we propose a joint multimodal contrastive learning framework that unifies both acoustic and cross-modal supervision in a shared embedding space. Our approach simultaneously optimizes: (i) audio-text contrastive learning, inspired by the CLAP loss, to align audio and text representations and (ii) audio-audio contrastive learning, via Deep Word Discrimination (DWD) loss, to enhance intra-class compactness and inter-class separation. The proposed method outperforms existing AWE baselines on word discrimination task while flexibly supporting both STD and KWS. To our knowledge, this is the first comprehensive approach of its kind.
Abstract（参考訳）: アコースティックワード埋め込み(AWE)は、音声検索タスク(STD)やキーワードスポッティング(KWS)の効率を改善する。しかし、既存のアプローチは、一元管理、オーディオオーディオと音声テキストアライメントの解離最適化、タスク固有のモデルの必要性といった制限に悩まされている。これらの欠点に対処するために,共用組込み空間における音響的・相互監視を統一するマルチモーダルコントラスト学習フレームワークを提案する。私たちのアプローチは、同時に最適化します。 (i)CLAPの損失にインスパイアされた音声テキストコントラスト学習は、音声とテキストの表現を整列させる。 (II) 音声・音声のコントラスト学習, ディープ・ワード・差別化(DWD)の損失により, クラス内コンパクト性とクラス間分離性を高める。提案手法は,STDとKWSの両方を柔軟にサポートしながら,単語識別タスクにおける既存のAWEベースラインよりも優れている。私たちの知る限りでは、この種の包括的アプローチとしてはこれが初めてのものです。

論文の概要: Joint Multimodal Contrastive Learning for Robust Spoken Term Detection and Keyword Spotting

関連論文リスト