Fugu-MT 論文翻訳(概要): AudioMosaic: Contrastive Masked Audio Representation Learning

論文の概要: AudioMosaic: Contrastive Masked Audio Representation Learning

arxiv url: http://arxiv.org/abs/2605.14231v1
Date: Thu, 14 May 2026 00:56:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.557898
Title: AudioMosaic: Contrastive Masked Audio Representation Learning
Title（参考訳）: AudioMosaic: 対照的なマスクドオーディオ表現学習
Authors: Hanxun Huang, Qizhou Wang, Xingjun Ma, Cihang Xie, Christopher Leckie, Sarah Erfani,
Abstract要約: 一般的な音声理解のためのコントラスト学習型オーディオエンコーダであるtextbfAudioMosaic を紹介する。 AudioMosaicは、構造化された時間周波数マスキングをスペクトログラムパッチに適用することで、正のペアを構成する。実験によると、AudioMosaicはいくつかの標準オーディオベンチマークで最先端のパフォーマンスを達成する。
参考スコア（独自算出の注目度）: 53.52371029884106
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio self-supervised learning (SSL) aims to learn general-purpose representations from large-scale unlabeled audio data. While recent advances have been driven mainly by generative reconstruction objectives, contrastive approaches remain less explored, partly due to the difficulty of designing effective audio augmentations and the large batch sizes required for contrastive pre-training. We introduce \textbf{AudioMosaic}, a contrastive learning-based audio encoder for general audio understanding. During pre-training, AudioMosaic constructs positive pairs by applying structured time-frequency masking to spectrogram patches, which reduces memory usage and enables efficient large-batch training. Compared with generative approaches, the AudioMosaic encoder learns more discriminative utterance-level representations that demonstrate strong transferability across datasets, domains, and acoustic conditions. Extensive experiments show that AudioMosaic achieves state-of-the-art performance on several standard audio benchmarks under both linear probing and fine-tuning. We further show that integrating the pretrained AudioMosaic encoder into audio-language models improves performance on audio-language tasks. The code is publicly available in our \href{https://github.com/HanxunH/AudioMosaic}{GitHub repository}.
Abstract（参考訳）: 音声自己教師型学習(SSL)は,大規模未ラベル音声データから汎用的な表現を学習することを目的としている。最近の進歩は、主に生成的再構成の目的によって推進されているが、効果的なオーディオ拡張設計の難しさや、コントラッシブ事前学習に必要な大きなバッチサイズのために、コントラスト的アプローチは検討されていない。一般的な音声理解のためのコントラスト学習型オーディオエンコーダである textbf{AudioMosaic} を紹介する。事前トレーニング中、AudioMosaicは、構造化された時間周波数マスキングをスペクトログラムパッチに適用することで、正のペアを構築する。生成的アプローチと比較して、AudioMosaicエンコーダは、データセット、ドメイン、音響条件間の強い伝達可能性を示す、より差別的な発話レベル表現を学習する。大規模な実験により、AudioMosaicは線形探索と微調整の両方の下で、いくつかの標準オーディオベンチマークで最先端のパフォーマンスを達成することが示された。さらに,事前学習したAudioMosaicエンコーダをオーディオ言語モデルに統合することで,音声言語タスクの性能が向上することを示す。コードは我々の \href{https://github.com/HanxunH/AudioMosaic}{GitHub repository} で公開されている。

論文の概要: AudioMosaic: Contrastive Masked Audio Representation Learning

関連論文リスト