Fugu-MT 論文翻訳(概要): Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study

論文の概要: Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study

arxiv url: http://arxiv.org/abs/2605.14031v1
Date: Wed, 13 May 2026 18:45:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.461859
Title: Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study
Title（参考訳）: 限られたデータを持つマスケードオートエンコーダは機能するか? : 微粒化バイオアコースティックスを用いたケーススタディ
Authors: Wuao Liu, Mustafa Chasmai, Subhransu Maji, Grant Van Horn,
Abstract要約: Masked Autoencoders (MAE) は大規模な音声コーパスに強い伝達性を示す。 iNatSoundsの種分類におけるMAE事前学習の系統的研究を行った。以上の結果から,中程度に微粒な生体音響条件下では,事前学習の規模が主観的な設計を担っていることが示唆された。
参考スコア（独自算出の注目度）: 20.469464200788583
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Bioacoustic recognition requires fine-grained acoustic understanding to distinguish similar-sounding species. However, many large-scale data repositories such as iNaturalist are weakly annotated, often with only a single positive species label per recording, making supervised learning particularly challenging. Inspired by advances in computer vision, recent approaches have shifted toward self-supervised learning to capture the underlying structure of audio without relying on exhaustive annotations. In particular, masked autoencoders (MAE) have shown strong transferability on massive audio corpora, yet their effectiveness in more modest bioacoustic settings remains underexplored. In this work, we conduct a systematic study of MAE pretraining for species classification on iNatSounds, analyzing the impacts of pretraining data scale, domain specificity, data curation, and transfer strategies. Consistent with prior work, we find that models pretrained on diverse general audio data achieve the best transfer performance on iNatSounds. Contrary to observations from large-scale audio benchmarks, we find that (1) additional masked reconstruction pretraining on domain-specific data provides limited benefits and may even degrade performance relative to off-the-shelf models, and (2) selective data filtering offers a negligible advantage when the overall data scale is limited. Our results indicate that, in moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design. These findings further clarify when MAE-based pretraining is effective and provide practical guidance for model selection under limited supervision.
Abstract（参考訳）: 生体音響認識は、類似音種を識別するために、きめ細かい音響的理解を必要とする。しかし、iNaturalistのような大規模なデータレポジトリの多くは、記録ごとに1つの正の種ラベルしか持たず、弱く注釈付けされているため、教師付き学習は特に困難である。コンピュータビジョンの進歩に触発されて、近年のアプローチは、徹底的なアノテーションに頼ることなく、音声の基盤構造を捉えるための自己教師型学習へと移行してきた。特に、マスク付きオートエンコーダ(MAE)は、大規模なオーディオコーパスに強い伝達性を示すが、より控えめなバイオ音響設定におけるそれらの効果はいまだ探索されていない。本研究では,iNatSoundsの種分類におけるMAE事前学習の体系的研究を行い,事前学習データスケール,ドメイン特異性,データキュレーション,転送戦略の影響を分析した。先行研究と一致して,iNatSounds上では,多種多様な一般音声データに事前訓練されたモデルが最も優れた転送性能が得られることがわかった。大規模オーディオベンチマークの観察とは対照的に,(1)ドメイン固有データに対するマスク付き再構成による事前学習は,限定的なメリットがあり,市販モデルと比較して性能が低下する可能性があり,(2)データ規模が制限された場合,選択的なデータフィルタリングは無視できる優位性がある。以上の結果から,中程度に微粒な生体音響条件下では,事前学習の規模が主観的な設計を担っていることが示唆された。これらの知見は,MAEによる事前訓練が有効であるかどうかをさらに明らかにし,限られた監督下でのモデル選択の実践的ガイダンスを提供する。

論文の概要: Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study

関連論文リスト