Fugu-MT 論文翻訳(概要): Autodecompose: A generative self-supervised model for semantic decomposition

論文の概要: Autodecompose: A generative self-supervised model for semantic decomposition

arxiv url: http://arxiv.org/abs/2302.03124v1
Date: Mon, 6 Feb 2023 21:18:09 GMT
ステータス: 翻訳完了
システム内更新日: 2023-02-08 18:13:18.578994
Title: Autodecompose: A generative self-supervised model for semantic decomposition
Title（参考訳）: autodecompose:意味分解のための生成的自己教師付きモデル
Authors: Mohammad Reza Bonyadi
Abstract要約: AutoDecomposeは、データを2つの意味的に独立した性質に分解する自己教師型生成モデルである。音声信号にAuto Decomposeを適用し、音源(人間の声)とコンテンツを符号化する。大規模なモデルが小さなデータセットで事前トレーニングされている場合でも,Autodecomposeはオーバーフィッティングに対して堅牢であることを示す。
参考スコア（独自算出の注目度）: 1.5990720051907859
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Autodecompose, a novel self-supervised generative model that decomposes data into two semantically independent properties: the desired property, which captures a specific aspect of the data (e.g. the voice in an audio signal), and the context property, which aggregates all other information (e.g. the content of the audio signal), without any labels given. Autodecompose uses two complementary augmentations, one that manipulates the context while preserving the desired property and the other that manipulates the desired property while preserving the context. The augmented variants of the data are encoded by two encoders and reconstructed by a decoder. We prove that one of the encoders embeds the desired property while the other embeds the context property. We apply Autodecompose to audio signals to encode sound source (human voice) and content. We pre-trained the model on YouTube and LibriSpeech datasets and fine-tuned in a self-supervised manner without exposing the labels. Our results showed that, using the sound source encoder of pre-trained Autodecompose, a linear classifier achieves F1 score of 97.6\% in recognizing the voice of 30 speakers using only 10 seconds of labeled samples, compared to 95.7\% for supervised models. Additionally, our experiments showed that Autodecompose is robust against overfitting even when a large model is pre-trained on a small dataset. A large Autodecompose model was pre-trained from scratch on 60 seconds of audio from 3 speakers achieved over 98.5\% F1 score in recognizing those three speakers in other unseen utterances. We finally show that the context encoder embeds information about the content of the speech and ignores the sound source information. Our sample code for training the model, as well as examples for using the pre-trained models are available here: \url{https://github.com/rezabonyadi/autodecompose}
Abstract（参考訳）: 本稿では、データの特定の側面(例えば、音声信号の音声)をキャプチャする所望のプロパティと、他のすべての情報(例えば、音声信号の内容)をラベルなしで集約するコンテキストプロパティという、2つの意味論的独立性にデータを分解する新しい自己教師型生成モデルであるAutodecomposeを紹介する。 Autodecomposeは2つの補完的な拡張を使用しており、ひとつは所望のプロパティを保持しながらコンテキストを操作する。データの拡張版は、2つのエンコーダによって符号化され、デコーダによって再構成される。エンコーダの一方が所望のプロパティを埋め込み、もう一方がコンテキストプロパティを組み込むことを証明します。音声信号に自動分解を適用し,音源(人間の声)とコンテンツの符号化を行う。このモデルをYouTubeとLibriSpeechのデータセットで事前トレーニングし、ラベルを公開せずに自己教師付きで微調整した。その結果,事前学習したオートコンプリートの音源エンコーダを用いて,10秒のラベル付きサンプルを用いた30話者の音声認識におけるF1スコア97.6\%を,教師付きモデルでは95.7\%と比較した。さらに,大規模モデルが小さなデータセット上で事前トレーニングされた場合でも,オーバーフィットに対してautodecomposeは堅牢であることを示した。 3つの話者から60秒間の音声をスクラッチから事前学習し、98.5\%のf1スコアを達成し、これら3つの話者を他の見当たらない発話で認識した。最後に、コンテキストエンコーダが音声の内容に関する情報を埋め込み、音源情報を無視していることを示す。このモデルをトレーニングするためのサンプルコードと、事前トレーニングされたモデルを使用するサンプルは、以下の通りである。

論文の概要: Autodecompose: A generative self-supervised model for semantic decomposition

関連論文リスト