Fugu-MT 論文翻訳(概要): S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

論文の概要: S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

arxiv url: http://arxiv.org/abs/2604.24933v1
Date: Mon, 27 Apr 2026 19:20:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-29 16:49:17.568061
Title: S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models
Title（参考訳）: S-SONDO:一般音響基礎モデルのための自己監督型知識蒸留
Authors: Mohammed Ali El Adlouni, Aurian Quelennec, Pierre Chouteau, Geoffroy Peeters, Slim Essid,
Abstract要約: S-SONDOは、出力埋め込みのみを使用して一般的なオーディオモデルを蒸留する最初のフレームワークである。 2つの音響基礎モデルを3つの効率的な学生に蒸留することで,その効果を実証する。
参考スコア（独自算出の注目度）: 24.103531000455003
License: http://creativecommons.org/licenses/by/4.0/
Abstract: General audio foundation models have recently achieved remarkable progress, enabling strong performance across diverse tasks. However, state-of-the-art models remain extremely large, often with hundreds of millions of parameters, leading to high inference costs and limited deployability on edge devices. Knowledge distillation is a proven strategy for model compression, but prior work in audio has mostly focused on supervised settings, relying on class logits, intermediate features, or architecture-specific techniques. Such assumptions exclude models that output only embeddings, such as self-supervised or metric-learning models. We introduce S-SONDO (Self-Supervised KnOwledge DistillatioN for General AuDio FOundation Models), the first framework to distill general audio models using only their output embeddings. By avoiding the need for logits or layer-level alignment, S-SONDO is architecture-agnostic and broadly applicable to embedding-based teachers. We demonstrate its effectiveness by distilling two audio foundation models into three efficient students that are up to 61 times smaller while retaining up to 96% of teacher performance. We also provide practical insights on loss choice and clustering-based balanced data sampling. Code is available here: https://github.com/MedAliAdlouni/ssondo.
Abstract（参考訳）: 一般的なオーディオ基礎モデルは、最近顕著な進歩を遂げ、多様なタスクで高いパフォーマンスを実現している。しかし、最先端モデルは極端に大きく、しばしば数億のパラメータを持ち、高い推論コストとエッジデバイスへのデプロイ可能性に繋がる。知識蒸留は、モデル圧縮の実証された戦略であるが、オーディオにおける以前の作業は、主にクラスロジット、中間機能、アーキテクチャ固有の技術に依存する教師付き設定に重点を置いていた。このような仮定は、自己監督モデルやメートル法学習モデルのような埋め込みのみを出力するモデルを除外する。 S-SONDO (Self-Supervised KnOwledge DistillatioN for General AuDio Foundation Models)を導入し,出力埋め込みのみを用いて一般的な音響モデルを蒸留する最初のフレームワークを提案する。ログや層レベルのアライメントを不要にすることで、S-SONDOはアーキテクチャに依存しず、埋め込みベースの教師に広く適用できる。教師のパフォーマンスの最大96%を保ちながら、最大61倍の効率のよい3人の生徒に2つの基礎モデルを蒸留することにより、その効果を実証する。また、損失選択とクラスタリングに基づくバランスデータサンプリングに関する実践的な洞察を提供する。コードは、https://github.com/MedAliAdlouni/ssondo.comで入手できる。

論文の概要: S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

関連論文リスト