Disentangled sequential autoencoders (DSAEs) represent a class of
probabilistic graphical models that describes an observed sequence with dynamic
latent variables and a static latent variable. The former encode information at
a frame rate identical to the observation, while the latter globally governs
the entire sequence. This introduces an inductive bias and facilitates
unsupervised disentanglement of the underlying local and global factors. In
this paper, we show that the vanilla DSAE suffers from being sensitive to the
choice of model architecture and capacity of the dynamic latent variables, and
is prone to collapse the static latent variable. As a countermeasure, we
propose TS-DSAE, a two-stage training framework that first learns
sequence-level prior distributions, which are subsequently employed to
regularise the model and facilitate auxiliary objectives to promote
disentanglement. The proposed framework is fully unsupervised and robust
against the global factor collapse problem across a wide range of model
configurations. It also avoids typical solutions such as adversarial training
which usually involves laborious parameter tuning, and domain-specific data
augmentation. We conduct quantitative and qualitative evaluations to
demonstrate its robustness in terms of disentanglement on both artificial and
real-world music audio datasets.
Towards Robust Unsupervised Disentanglement of Sequential Data —
ロバスト非教師によるシークエンシャルデータのアンタングル化に向けて
0.47
A Case Study Using Music Audio
音楽音声を用いた事例研究
0.84
Yin-Jyun Luo1∗ , Sebastian Ewert2 and Simon Dixon1 1Centre for Digital Music, Queen Mary University of London
Yin-Jyun Luo1∗ , Sebastian Ewert2 and Simon Dixon1 1Centre for Digital Music, Queen Mary University of London 訳抜け防止モード: Yin - Jyun Luo1∗, Sebastian Ewert2, Simon Dixon1 1Centre for Digital Music ロンドンのクイーンメアリー大学
Abstract Disentangled sequential autoencoders (DSAEs) represent a class of probabilistic graphical models that describes an observed sequence with dynamic latent variables and a static latent variable.
The former encode information at a frame rate identical to the observation, while the latter globally governs the entire sequence.
前者は観察と同一のフレームレートで情報を符号化し、後者は全シーケンスをグローバルに制御する。
0.73
This introduces an inductive bias and facilitates unsupervised disentanglement of the underlying local and global factors.
これは帰納バイアスを導入し、根底にある局所的および大域的要因の教師なしの解離を促進する。
0.38
In this paper, we show that the vanilla DSAE suffers from being sensitive to the choice of model architecture and capacity of the dynamic latent variables, and is prone to collapse the static latent variable.
As a countermeasure, we propose TS-DSAE, a two-stage training framework that first learns sequence-level prior distributions, which are subsequently employed to regularise the model and facilitate auxiliary objectives to promote disentanglement.
We conduct quantitative and qualitative evaluations to demonstrate its robustness in terms of disentanglement on both artificial and real-world music audio datasets.1
1 Introduction From a probabilistic point of view, representation learning involves a data generating process governed by multiple explanatory factors of variation [Bengio, 2013].
The goal of learning a disentangled representation is to extract the underlying factors such that perturbations of one factor only change certain attributes of the observation.
For example, one can disentangle object identity from its motion in video [Denton and Birodkar, 2017], separate sentiment from content in natural language [Fu et al , 2017], model style and linguistic
例えば、ビデオ[denton and birodkar, 2017]では、オブジェクトのアイデンティティを動きから切り離し、自然言語の内容から感情を分離することができる [fu et al, 2017]、モデルスタイルと言語
0.81
∗Contact Author 1The implementation and audio samples are accessible from
∗Contact Author 1 実装とオーディオサンプルはアクセス可能である
0.86
https://github.com/y jlolo/dSEQ-VAE.
https://github.com/y jlolo/dSEQ-VAE.com
0.17
Figure 1: System diagrams of Two-Stage DSAE.
図1: 2段階DSAEのシステム図。
0.81
Left: The constrained training stage where the local modules are frozen.
左: ローカルモジュールが凍結される、制約付きトレーニングステージ。
0.74
Right: The stage of informed-prior training where the global latent is regularised by the associated posterior learnt from the first stage.
右:第1段階から学習した後部学習者によってグローバル潜伏者が正規化される情報優先訓練の段階。
0.69
The dashed arrows denote broadcast along the time-axis.
矢印は時間軸に沿って放送される。
0.71
information independently in speech [Hsu et al , 2017], and learn distinct representations for genre in music [Brunner et al., 2018].
音声[hsu et al , 2017]で独立して情報を収集し,音楽ジャンルの異なる表現 [brunner et al., 2018] を学ぶ。
0.78
In this sense, disentangled representation promotes model interpretability by exposing semantically meaningful features, and enables controllable data generation by feature manipulation.
[2019], disentanglement can only be achieved with either supervision or inductive biases – and hence any unsupervised system for learning disentangled representations has to involve the latter.
In this case, the observation is generated by a static (global) latent variable associated with the entire sequence, and a series of dynamic (local) latent variables varying over time [Hsu et al., 2017; Li and Mandt, 2018; Khurana et al , 2019; Zhu et al , 2020; Vowels et al , 2021; Han et al , 2021; Bai et al , 2021].
この場合、観測は、全配列に付随する静的な(グローバルな)潜伏変数と、時間とともに変化する動的(局所的な)潜伏変数によって生成される(Hsu et al., 2017; Li and Mandt, 2018; Khurana et al , 2019; Zhu et al , 2020; Vowels et al , 2021; Han et al , 2021; Bai et al , 2021]。
0.81
The disentangled sequential autoencoder (DSAE) [Li and Mandt, 2018] is a minimalistic framework that implements the concept above using a probabilistic graphical model, as illustrated in Fig 2.
分散型シーケンシャルオートエンコーダ (DSAE) [Li and Mandt, 2018] は、図2に示すように、確率的グラフィカルモデルを用いて上記の概念を実装する最小限のフレームワークである。
0.82
However, as we show in Section 6, DSAE does not robustly achieve disentanglement but heavily relies on a problem-specific architecture design and parameter tuning.
Several works have built upon DSAE, extending it with either self-supervised learning techniques based on domain-specific
いくつかの作品がDSAE上に構築され、ドメイン固有性に基づく自己教師型学習技術で拡張されている
0.58
英語(論文から抽出)
日本語訳
スコア
data-augmentation [Bai et al , 2021], alternative distance measures for the distributions involved which require extensive hyperparameter tuning or estimations susceptible to the instability resulting from adversarial training [Han et al , 2021], or a rather complex parameterisation of a computationally heavy generative model [Vowels et al , 2021].
data-augmentation [bai et al , 2021], 敵対的訓練(han et al , 2021])によって生じる不安定さに影響を受けやすい広範囲なハイパーパラメータチューニングや推定を必要とする分布の代替距離尺度 [vowels et al , 2021], あるいは計算量重生成モデルのかなり複雑なパラメータ化 [vowels et al , 2021]。
0.81
In order to improve the robustness of DSAE, we propose TS-DSAE, a simple yet effective framework encompassing a two-stage training method as well as explicit regularisation to improve factor invariance and manifestation.
The framework is completely unsupervised and free from any form of data augmentation or adversarial training (but could be combined with either in the future).
We use an artificial as well as a real-world music audio dataset to verify the effectiveness of the proposed framework over a wide range of configurations, and provide both quantitative and qualitative evaluations.
While the baseline models suffer from the collapse of the global latent space, TS-DSAE consistently provides reliable disentanglement (as measured by a classification metric), improves reconstruction quality with increased network capacity without compromising disentanglement, and is able to accommodate multiple global factors shared in the same latent space.
2 Disentangled Sequential Autoencoders DSAEs [Li and Mandt, 2018; Zhu et al , 2020; Bai et al , 2021; Han et al , 2021; Vowels et al , 2021] are a family of probabilistic graphical models representing a joint distribution
2 Disentangled Sequential Autoencoders DSAEs [Li and Mandt, 2018; Zhu et al , 2020; Bai et al , 2021; Han et al , 2021; Vowels et al , 2021] は共同分布を表す確率的図形モデルである。
Figure 2: The two models proposed in the original DSAE.
図2:オリジナルのDSAEで提案された2つのモデル。
0.73
The red arrows highlight the enriched inference networks qφ(·).
赤い矢印は豊かな推論ネットワーク qφ(·) を強調する。
0.75
We investigate the two configurations illustrated in Fig 2.
図2に示す2つの構成について検討する。
0.69
“full q” follows the inference networks written in Eq (2), and qφ(zt|x1:T , v) can be implemented via RNNs; while “fact=1 qφ(zt|xt) with an FCN shared across the time-axis, which is independent of v. In both cases, qφ(v|x1:T ) can be parameterised by either RNNs or FCNs. We will use “factorised q” for the exposition in Section 3.
A major challenge is that optimising Eq (2) does not prevent the local latent z1:T from capturing all the necessary information for reconstructing the given input sequence x1:T .
This is referred to as the “shortcut problem” [Lezama, 2019], where the model completely ignores some latent variables (the global in this case) and only utilises the rest.
3 Method We propose TS-DSAE, which constitutes a two-stage training framework and explicitly imposes regularisation for factor invariance as well as factor rendering in order to encourage disentanglement, as illustrated in Fig 1 which depicts the simplified inference network (factorised q) to avoid clutter.
3.1 Two-Stage Training Framework The shortcut problem mentioned in Section 2 can be ascribed to the simplicity of the uni-modal prior p(v) which is not expressive enough to capture the multi-modal global factors, i.e. qφ(v|x1:T ) is over-regularised.
The issue is further exaggerated by the relatively capacity-rich local latent z1:T which are allowed to carry information at the frame resolution identical to x1:T .
Constrained training: During constrained training, we freeze some parameters of the local module after initialization including the local encoder and the transition network.
This way, the local latents zt resemble random projections from the input and thus are not optimised to hold the most important information to encode the input.
where x1:T denotes the observed sequence with T time frames, z1:T is the sequence of local latent variables, and v refers In practice, pθ(zt|z<t) = to the global latent variable.
ここで x1:T は T 時間フレームで観測された列を表し、z1:T は局所潜在変数の列を表し、v は実際には pθ(zt|z<t) = を大域潜在変数に言及する。
0.70
neural networks (RNNs), and pθ(xt|zt, v) is implemented using fully-connected networks (FCNs).
latent variables z1:T and v for the local and global factors, respectively, imposing an inductive bias for unsupervised disentanglement, which is otherwise impossible [Locatello et al., 2019].
局所的因子と大域的因子の潜在変数 z1:t と v はそれぞれ、教師なしの不連続に対して帰納的バイアスを課すが、そうでなければ不可能である [locatello et al., 2019]。 訳抜け防止モード: 局所的および大域的因子に対する潜伏変数 z1 : T と v である。 それぞれ 教師なしの絡み合いに対して 誘導バイアスを課し さもなければ不可能だ[Locatello et al ., 2019 ].
0.73
The uni-modal prior p(v), however, poses a great challenge to learning an informative latent space, evidenced by our results in Section 6.
In particular, instead of setting the global prior to N(cid:0)0, 1(cid:1) as
特に、大域を N(cid:0)0, 1(cid:1) の前に設定する代わりに、
0.73
in constrained training, we set:
制約のあるトレーニングでは
0.46
p(vi) = qφC (vi|xi
p(vi) = qφC(vi|xi)
0.38
1:T ), 1:T )(cid:107)qφC (vi|xi
1:T)であった。 1:T )(cid:107)qφC (vi|xi)
0.47
(cid:0)qφ(vi|xi
(cid:0)qφ(vi|xi)
0.31
1:T )(cid:1).
1:t(cid:1)。
0.76
Note that (3) where φC denotes the parameters of the global encoder at the C-th epoch.
注意 (3) φc は c 番目の時代における大域エンコーダのパラメータを表す。
0.61
That is, we have for each input sequence i a corresponding sequence-level prior that has been learnt from constrained training, whereby the last KL term in Eq (2) is replaced by DKL we differentiate qφ from qφC to emphasise that we take a “snapshot” of the global encoder qφC (·) at the C-th epoch, use the network to parameterise the sequence-specific prior, and continue training the global encoder qφ(·) which is initialised by φC.
In other words, we keep training the posterior but “anchor” the distribution of each sequence i to its associated prior which is the posterior obtained from constrained training and is supposed to capture the sequence-level global factors.
This way, although the local module is introduced over the training, the global latent variables of sequences no longer commonly share the uni-modal prior, thereby mitigating the effect of over-regularisation.
In the next section, we further propose four additional loss terms to encourage disentanglement of the global and local latent variables.
次の節では、グローバル変数とローカル変数の絡み合いを促進するために、さらに4つの損失項を提案する。
0.58
3.2 Factor Invariance and Manifestation Consider the following scheme of inference, replacement, de1:T ∼ coding, and inference: given the inferred variables zi qφ(z1:T|xi 1:T ), we can replace vi with vj inferred from another sequence j, and decode ∼ xvi→vj qφ(z1:T|xvi→vj If z1:T and v have been successfully disentangled, the difference between zvi→vj 1:T would be minimal because replacing the global factor should not affect the subsequently inferred local factor; and vvi→vj should be close to vj in order to faithfully manifest the swapping.
3.2 Factor Invariance and Manifestation Consider the following scheme of inference, replacement, de1:T ∼ coding, and inference: given the inferred variables zi qφ(z1:T|xi 1:T ), we can replace vi with vj inferred from another sequence j, and decode ∼ xvi→vj qφ(z1:T|xvi→vj If z1:T and v have been successfully disentangled, the difference between zvi→vj 1:T would be minimal because replacing the global factor should not affect the subsequently inferred local factor; and vvi→vj should be close to vj in order to faithfully manifest the swapping. 訳抜け防止モード: 3.2 因子の不変性及び操作性 次に掲げる推論方式を考える。 置換、de1 : T の符号化、推論 : 推定変数 zi qφ(z1 : T|xi 1 : T ) が与えられる 他の列 j から推論された vj に vi を置き換えることができ、z1 : T と v がアンタングル化に成功すれば、xvi→vj qφ(z1 : T|xvi→vj) を復号できる。 zvi→vj 1 : T の差は最小である グローバル因子の置き換えは、その後の推論された局所因子に影響を与えない vvi→vj は vj に近いものでなければならない。
0.69
Similarly, if we replace z1:T instead, difference between vzi 1:T and vi is expected to be small; and zzi 1:T .
同様に、z1:T を置き換えれば、vzi 1:T と vi の差は小さくなり、zzi 1:T は小さくなる。
0.71
We can impose the desired properties of factor invariance as well as the rendering of the target factors by introducing the following terms to Eq (2):
(7). In practice, we pair each input sequence i in a mini-batch with a randomly sampled input sequence j from the same minibatch, and perform the above-mentioned scheme of inference,
However, we found scaling them unnecessary for the success of disentanglement, and leave this study for future work.
しかし,解離の成功のためにスケーリングは不要であることに気付き,今後の研究にこの研究を委ねた。
0.67
To summarise, TS-DSAE constitutes a two-stage training framework that facilitates the exploitation of additional divergences to achieve robust unsupervised disentanglement, which we empirically verify in Section 6.
4 Related Work The assumption of a sequence being generated by a stationary global factor and a temporally changing local factor to achieve unsupervised disentanglement was used before.
FHVAE [Hsu et al , 2017] constructs a hierarchical prior where each input is governed by a sequence-level prior on top of a segmentlevel prior.
fhvae [hsu et al , 2017] は、各入力がセグメントレベル前のシーケンスレベルによって管理される階層型プリアーを構築する。
0.82
Our two-stage training framework shares the spirit, with the main difference being that we leverage the strong bottleneck during the constrained training to naturally promote a global information-rich posterior which can be directly used as the sequence-level prior for the complete model training stage.
On the other hand, FHVAE initialises and learns the prior from scratch, which lacks a stronger inductive bias and a discriminative objective function is reported to be helpful.
Also, learning of the sequence-level priors is amortised by the global encoder in our model, whereby memory consumption does not scale with the number of training data as in FHVAE.
The vanilla DSAE [Li and Mandt, 2018] is proposed as an elegant minimalistic model to achieve disentanglement, as shown in Fig 2.
図2に示すように、バニラDSAE[Li and Mandt, 2018]は、絡み合いを実現するためのエレガントな最小主義モデルとして提案されている。
0.66
However, we demonstrate its tendency to collapse the global latent space in Section 6, which is likely due to the over-simplified standard Gaussian prior.
R-WAE [Han et al , 2021] minimises the Wasserstein distance between the aggregated posterior and the prior instead, estimated by maximum mean discrepancy or generative adversarial networks, either of which is not trivial in terms of parameter tuning and optimisation.
R-WAE [Han et al , 2021] は、パラメータチューニングや最適化の点で自明ではない最大平均誤差または生成逆ネットワークによって推定される、集約後と前とのワッサーシュタイン距離を最小化する。
0.81
S3-VAE [Zhu et al , 2020] and C-DSVAE [Bai et al , 2021] exploit self-supervised learning and employ either domain-specific ad-hoc loss functions or data augmentation.
s3-vae [zhu et al , 2020] と c-dsvae [bai et al , 2021] は自己教師付き学習を活用し、ドメイン固有のアドホック損失関数やデータ拡張機能を採用する。
0.60
The proposed TS-DSAE is free from any form of supervision, adversarial training, or domain-dependent data augmentation.
提案されたTS-DSAEは、いかなる種類の監督、敵訓練、ドメインに依存したデータ拡張も不要である。
0.41
VDSM adopts a pre-training stage as well as a scheme of KL-annealing to promote usage of the global latent space [Vowels et al , 2021], which is similar to our constrained training.
VDSMは、我々の制約された訓練に類似したグローバル潜在空間(Vowels et al , 2021]の使用を促進するため、事前学習段階とKLアニール方式を採用する。
0.76
The main differences, however, are that we train only the global variable during “pre-trainig”, and avoid KLannealing to save the tuning efforts.
Further, VDSM employs n decoders, each of which is responsible for a unique identity of a video object, where n is set manually depending on the dataset.
data. The framework first trains a network with a low capacity latent space in order to learn the factors of interest, and subsequently increases the latent space capacity to improve data reconstruction.
Our constrained training stage is also reminiscent of multiview representation learning.
制約付きトレーニングステージもマルチビュー表現学習を思い出させる。
0.64
For example, VCCA [Wang et al., 2016] formulates a model that samples different views of a common object from distributions conditioned on a shared latent variable.
例えば、VCCA [Wang et al., 2016] は、共有潜在変数に条件付き分布から共通のオブジェクトの異なるビューをサンプリングするモデルを定式化する。
0.79
NestedVAE [Vowels et al , 2020] learns the common factors using staged information bottlenecks by training a low-level VAE given the latent space derived from a high-level VAE.
NestedVAE [Vowels et al , 2020]は、高レベルのVAEから派生した潜伏空間を考慮に入れた低レベルのVAEをトレーニングすることにより、ステージド情報ボトルネックを用いて共通の要因を学習する。 訳抜け防止モード: NestedVAE [ Vowels et al, 2020 ] はステージ情報ボトルネックを用いて共通要因を学習する 高レベルのVAEから派生した潜伏空間が与えられた低レベルのVAEを訓練する。
0.63
In our model, given an input sequence, we treat multiple time frames as the different “views” of a common underlying factor which is the global factor.
There has been a lack of exploration in unsupervised disentangled representation for music audio.
音楽オーディオの教師なしの異端表現には、探究の欠如がある。
0.61
Both Luo et al [2020] and C´ıfka et al [2021] exploit self-supervised learning to decorrelate instrument pitch and timbre.
luo et al [2020] と c ́fka et al [2021] は、楽器ピッチと音色を分離するために自己教師付き学習を利用する。 訳抜け防止モード: luo et al [2020 ] と c ́fka et al [ 2021 ] exploit self - supervised learning 楽器ピッチと音色を分離する。
0.70
Similar to our work, the latter models monophonic melodies.
私たちの作品と同様に、後者はモノフォニックメロディーをモデルにしている。
0.56
Yet, it employs pitch-shifting which is domain-dependent, and constrains the local capacity by learning discrete latent variables which might pose optimisation challenges.
The former facilitates the control over the underlying factors of variation, while the latter demonstrates applicability of the proposed model to realistic data.
dMelodies: The artificial dataset is compiled by synthesising audio from monophonic symbolic music gathered from dMelodies [Pati et al , 2020].
dMelodies: この人工データセットは,dMelodies [Pati et al , 2020] から収集したモノフォニックシンボリック音楽から音声を合成することによってコンパイルされる。 訳抜け防止モード: dMelodies : 人工データセットはコンパイルされる dMelodies[Pati et al, 2020]から収集したモノフォニックシンボリック音楽からの音声合成
0.83
Each melody is a two-bar sequence with 16 eighth notes, subject to several global factors, i.e., tonic, scale, and octave, and local factors, i.e., direction of arpeggiation, and rhythm.
We also discard melodies starting or ending with the rest note to avoid spurious amplitude values and boundaries during audio synthesis with FluidSynth.2
We randomly pick 3k samples from the remaining melodies which are then split into 80% training and 20% validation sets, and synthesise audio of sampling rate 16kHz using sound fonts of violin and trumpet from MuseScore General.sf3.3 The amplitude of each audio sample is normalised with respect to its maximum value.
The number of samples rendered with the two instruments is uniformly distributed.
2つの楽器で描画されたサンプルの数は均一に分配される。
0.66
URMP: For the real-world audio recordings, we select the violin and trumpet tracks from the URMP dataset [Li et al , 2019].
URMP: 実際のオーディオ録音では、URMPデータセットからヴァイオリンとトランペットのトラックを選択します [Li et al , 2019]。
0.68
We follow the preprocessing by Hayes et al [2021], where the amplitude of each audio recording, resampled to 16kHz, is normalised in a corpus-wide fashion for each instrument subset.
我々はHayes et al[2021]による前処理に従い、各オーディオ録音の振幅を16kHzに再サンプリングし、各楽器サブセットに対してコーパスワイドな方法で正規化する。
The audio samples are then divided into four-second segments, and segments with mean pitch confidence lower than 0.85 are discarded, as assessed by the full CREPE model [Kim et al , 2018], a state-of-the-art pitch extractor.
そして、音声サンプルを4秒のセグメントに分割し、最先端ピッチ抽出器であるフルクレープモデル[kim et al , 2018]で評価した平均ピッチ信頼度0.85以下のセグメントを破棄する。
0.75
The process results in 1,545 violin and 534 trumpet samples in the training set, and 193 violin and 67 trumpet samples for validation.
We transform the audio samples and represent the data as log-amplitude mel-spectrogram with 80 mel filter banks, derived from a short-time Fourier transform with a 128ms Hann window and 16ms hop, leading to x1:T ∈ R80×251.
We use net-[layers] to denote architectures of modules, where net indicates types of the network, and [layers] is a list specifying the numbers of neurons at each layer.
If a Gaussian parameterisation layer follows, we append the notation Gau-L which encompasses two linear layers with parameters w1 and w2 projecting the output hidden (h) ∈ RL, respectively, states h to µw1 (h) ∈ RL and log σ2 where the Gaussian variable living in an L-dimensional space
w2 the global encoder qφ(v|x1:T ) as FCN-[64,64]-Avg-Gau-16, where Avg denotes average pooling across the time-axis, and we keep the size of v fixed as 16 across our main experiments; and the local encoder qφ(z1:T|x1:T ) as FCN-[64,64]-Gau-{8,16,32}, where we investigate different sizes of z1:T .
The decoder pθ(xt|zt, v) is FCN-[64,64]-Gau-80 taking as input the concatenation of z1:T and time-axis broadcast v. Note that, following the convention of VAEs, the Gaussian uates the likelihood pθ(xt|zt, v) as the squared L2-norm between the output of the decoder and xt.
デコーダ pθ(xt|zt, v) は、z1:t と時間軸放送 v の結合を入力として fcn-[64,64]-gau-80 である。 訳抜け防止モード: デコーダ pθ(xt|zt, v ) は FCN-[64,64]-Gau-80 であり、z1 : T の結合と時間軸放送v を入力とする。 VAEの慣例に従い、ガウス群は pθ(xt|zt,) を使用できる。 v ) はデコーダの出力と xt の間の正方形 L2-ノルムである。
0.76
qφ(z1:T|x1:T , v)
qφ(z1:t|x1:t , v)
0.79
layer of the decoder parameterises N(cid:0)µw1(h), 1(cid:1) which eval-
デコーダの層は n(cid:0)μw1(h), 1(cid:1) をパラメータ化する。
0.71
For that factorised to biRNN-[64,64]-Gau-{8,16,32} which takes input the concatenation of x1:T and time-axis broadcast v inferred from qφ(v|x1:T ).
birnn-[64,64]-gau-{8,16,32} に因み、これは qφ(v|x1:t ) から推定される x1:t と時間軸放送 v の結合を入力とする。
0.63
biRNN denotes a bi-LSTM, where the outputs of the forward and backward LSTM are averaged along the time-axis before the Gaussian layer.
Both the transition network and decoder follow those of factorised q.
遷移ネットワークとデコーダはともに因子化されたqのそれに従う。
0.68
Optimisation: Our implementation is based on PyTorch v1.9.0 and we use ADAM [Kingma and Ba, 2015] with default parameters lr = 0.001, and [β1, β2] = [0.9, 0.999] without weight decay.
最適化: 私たちの実装はPyTorch v1.9.0に基づいており、デフォルトパラメータ lr = 0.001 と [β1, β2] = [0.9, 0.999] の ADAM [Kingma, Ba, 2015] を使っています。
0.73
We use a batch size of 128, and train the models for 4k epochs at most; we employ early stopping if Eq (2) obtained from the validation set stops improving for 300 epochs.
For the models adopting the proposed two-stage training frameworks presented in Section 3, we set the number of epochs for the first stage C = 300 for all cases, to which we find the performance insensitive.
We do not include the models mentioned in Section 4 [Zhu et al , 2020; Bai et al , 2021; Han et al , 2021] which is left for future work, because the main focus is to improve upon DSAE with minimum modifications, and thus provide a superior backbone model which can be complementary with the existing methods.
第4節(zhu et al , 2020; bai et al , 2021; han et al , 2021)に記載されたモデルは、最小限の変更でdsaeを改善することが主な目的であり、既存の手法を補完できる優れたバックボーンモデルを提供するため、将来の作業のために残されている。
0.71
Instrument Classification 6.1 We first evaluate disentanglement through the lens of instrument classification.
楽器分類 6.1 計器分類のレンズによる歪みの評価を最初に行った。
0.67
In particular, we train a linear discriminant analysis (LDA) classifier taking as inputs v ∼ qφ(v|x1:T ), the global latent variables sampled from a learnt model, derived from the training set, and evaluate its classification accuracy for instrument identity in terms of the macro F1-score on the validation set.
We pair each sequence i from the validation set with another sequence j recorded with the other instrument, and perform the scheme of inference, replacement, decoding, and inference.
Following the notation in Section 3.2, vvi→vj should be predictive of the instrument of sample j; while vzi 1:T should reflect the original instrument of sample i.
節3.2の表記に従うと、vvi→vj は標本 j の楽器を予測すべきであり、vzi 1:T は標本 i の楽器を反映すべきである。
0.65
We report three metrics including accuracy before the replacement (pre-swap), after replacing v (post-global swap), and after replacing z1:T (post-local swap).
Note that we use the mean parameters of the Gaussian posterior qφ(v|x1:T ) to train the LDA.
ガウス後方 qφ(v|x1:T ) の平均パラメータを使って LDA を訓練することに注意。
0.71
1:T →zj The results are summarised in Fig 3.
1:T →zj 結果は図3にまとめられている。
0.66
The proposed TSDSAE (red), with either factorised or full q, is consistently located at the top right corner of the plot, across all the sizes of the local latent space.
This indicates its robust disentanglement as well as a linearly separable global latent space.
これは、その強固な絡み合いと線形に分離可能な大域的潜在空間を示している。
0.54
From the left to right column, the competing methods DSAE-f (cyan)
左から右へ、競合するメソッドDSAE-f(シアン)
0.65
Figure 4: FAD (the lower the better) of reconstruction versus macro F1 score for instrument classification, evaluated using URMP.
図4: URMP を用いて評価した楽器分類において, 再建の FAD とマクロ F1 スコアの比較を行った。
0.74
See Section 6.2 for details.
詳細は6.2節を参照。
0.64
and TS-DSAE without the additional regularisations (orange) move from top right to left-hand side of the plot, showing the inclination for a collapsed global latent space with the increased local latent capacity.
Being located at lower left of the plot, DSAE (gray) attains the worst performance in most configurations.
dsae (gray) はプロットの左下に位置するため、ほとんどの構成で最悪の性能を達成している。
0.72
This highlights the issue of positing the standard Gaussian prior in the global latent space.
これは、グローバル潜在空間に先立って標準ガウスを仮定する問題を強調している。
0.66
The overall high pre-swap and low post-swap F1 especially towards high-dimensional zt implies that the decoder tends to ignore v, even though the mean parameter of qφ(v|x1:T ) is discriminative w.r.t. the instrument identity.
特に高次元 zt への全高スワップと低低スワップ F1 は、qφ(v|x1:T ) の平均パラメータが楽器の同一性に対して差別的であるにもかかわらず、デコーダが v を無視する傾向があることを示している。
0.64
The competing models appear to suffer the most from the size of zt as a large local latent space can easily capture all the necessary information for reconstruction.
6.2 Reconstruction Quality We examine the trade-off between disentanglement and reconstruction in terms of Fr´echet Audio Distance (FAD) [Kilgour et al , 2019] which is reported to correlate with auditory perception.
6.2リコンストラクションの質について,聴覚知覚と相関が報告されているfr 'echet audio distance (fad) [kilgour et al , 2019] を用いて,不連続とリコンストラクションのトレードオフについて検討した。
0.56
We only report the results for URMP in Fig 4 as both datasets reach a similar summary.
両データセットが類似の要約に達すると、図4のURMP結果のみを報告します。
0.77
As expected, FAD is improved with increasing zt dimension.
予想通り、FADはzt次元の増大とともに改善される。
0.54
However, TS-DSAE is the only model that overcomes the trade-off, in the sense that competing models lose their ability to disentangle (move from right to left of the plot) with the improved FAD.
6.3 Raw Pitch Accuracy In this section, we evaluate z1:T by applying the full CREPE model [Kim et al , 2018] to audio re-synthesised from the melspectrogram.
6.3 Raw Pitch Accuracy この節では,メルスペクトログラムから再生した音声に完全なCREPEモデル(Kim et al , 2018)を適用することにより,z1:Tを評価する。
0.73
The conversion is done by InverseMelScale and GriffinLim accessible from torchaudio v0.9.0.
Using the notation from Section 3.2, we extract pitch contours from reconstructed samples (pre-swap), xvi→vj (post-global swap) which is supposed to mirror the pitch contour of xi 1:T , and xzi (post-local swap) which is supposed to follow the pitch contour of xj 1:T .
Note that for models with trivial v, the accuracy of post-global swap will remain high as the decoder is independent of v. We extract pitch contours from the input data as the ground-truth and report the raw pitch accuracy (RPA) with a 50-cent threshold [Salamon et al , 2014].
自明な v を持つモデルでは、デコーダが v から独立しているため、ポスト・グロバルスワップの精度は高いままであり、入力データからピッチ輪郭を抽出し、50セントの閾値で原ピッチ精度(RPA)を報告する(Salamon et al , 2014)。 訳抜け防止モード: 自明な v を持つモデルの場合、ポスト-グローバルスワップの精度は依然として高いままである。 デコーダは独立している 五 入力データからピッチ輪郭を地として抽出する 真実 The raw pitch accuracy (RPA) with a 50-cent threshold [ Salamon et al, 2014 ]。
0.73
1:T →zj 1:T
1:T →zj 1:T
0.48
1:T 1:T
1:T 1:T
0.43
英語(論文から抽出)
日本語訳
スコア
Figure 5: RPA assessed using CREPE on URMP.
図5: URMP上でCREPEを使用して評価されたRPA。
0.62
We report the results with URMP in Fig 5.
図5のURMPによる結果について報告する。
0.64
TS-DSAE consistently improves with the increasing size of zt in terms of RPA.
TS-DSAE は RPA の点で zt の増大とともに常に改善される。
0.70
Except for post-local swap, TS-DSAE performs comparably with the competing models towards the larger zt, and achieves disentanglement at once.
6.4 Richer Decoders To mitigate the trade-off, we further construct and evaluate a richer decoder where the reconstruction of xt is conditioned on z1:T , i.e., the entire sequence of local latent variables, instead of zt.
We set the size of zt to 16, and the inference network to factorised q, and compare DSAE, TS-DSAE, and the TS-DSAE augmented with the enriched decoder.
As shown in Fig 6, the enriched model maintains the perfect accuracy for instrument classification for both datasets, while improving FAD over its counterpart with the factorised decoder.
We leave the evaluation for the full range of configurations for future work, including autoregressive decoders that could cause posterior collapse even for vanilla VAEs.
6.5 Multiple Global Factors We now consider both the fourth and fifth octaves when synthesising the dMelodies dataset, introducing octave number as the other global factor of variation in addition to instrument identity.
We train the decoder-enriched TS-DSAE described in Section 6.4, and show the results in Fig 7.
第6.4節で記述したデコーダ強化TS-DSAEを訓練し、その結果を第7節で示す。
0.57
In particular, we replace v inferred from the source at the lower left, with that derived from one of the three targets displayed in the top row, and generate novel samples shown from the second to last columns of the bottom row.
We use {i, o, m} to denote the instrument, octave, and melody of each sample, respectively.
各サンプルの楽器、オクターブ、メロディを表すためにそれぞれ {i, o, m} を用いる。
0.65
For example, the source {i1, o1, m1} and the first target {i2, o1, m2} share the same octave but differ in the instrument, characterised by the spectral distribution along the frequency axis.
As a result of replacing v, the target instrument i2 is manifested in the outcome {i1→2, o1, m1}, while the octave remains unchanged.
v を置換した結果、ターゲット機器 i2 は結果 {i1→2, o1, m1} に現れるが、オクターブは変化しない。
0.70
Similarly, the second target {i1, o2, m3} differs from the source with the octave, characterised by the level of pitch contour; therefore, swapping v only transforms the octave for the output {i1, o1→2, m1}.
Finally, the sample {i1→2, o1→2, m1} results from using the target {i2, o2, m4} that does not share any of the attributes with the source, where both the instrument and octave are converted.
We have proposed TS-DSAE, a robust framework for unsupervised sequential data disentanglement, which has been shown to consistently work over a wide range of settings.
Our evaluation focuses on the ability to robustly achieve disentanglement, and we leave evaluations on multi-modal data generation from unconditional prior sampling for future work.
Scaling the regularisation terms differently might be helpful as mentioned in Section 3.2.
規則化用語の異なるスケーリングは、セクション3.2で述べたように役に立ちます。
0.54
Moreover, DSAEs probabilistic graphical model forces the input sequence to have a single global latent variable fixed over time, which could be too restrictive for more general use cases where sequences do not have stationary factors but ones that evolve slowly over time.
Therefore, adopting a hierarchy of latent variables encoding information at low to high frame rates [Saxena et al , 2021] can be a favorable relaxation of DSAEs.
したがって、低フレームレート(Saxena et al , 2021)で情報を符号化する潜伏変数の階層を採用することは、DSAEの緩和に好適である。
0.76
A potential extension of our two-stage training is to have multiple stages of constrained training with progressively larger network capacity, thereby accommodating the said hierarchy, which can also be seen as a temporal extension of Li et al.
Acknowledgments The first author is a research student at the UKRI Centre for Doctoral Training in Artificial Intelligence and Music, supported by a scholarship from Spotify.
最初の著者は、Spotifyの奨学金によって支援されたUKRI Centre for Doctoral Training in Artificial Intelligence and Musicの研究学生である。
0.79
References Junwen Bai, Weiran Wang, and Carla Gomes.
junwen bai、weiran wang、carla gomesを参照。
0.45
Contrastively In Ad- disentangled sequential variational autoencoder.
Zhiyuan Li, Jaideep Vitthal Murkute, Prashnna Kumar Gyawali, Linwei Wang
0.34
Progressive learning and disentanglement of hierarchical representations.
進化的学習と階層的表現の切り離し。
0.67
ArXiv, 2020.
ArXiv、2020年。
0.87
Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar R¨atsch, Sylvain Gelly, Bernhard Sch¨olkopf, and Olivier Bachem.
フランチェスコ・ロカテッロ、ステファン・バウアー、マリオ・ルシック、ガンナー・R・シャチュ、シルヴァン・ゲリー、ベルンハルト・シュ・ショルコプフ、オリヴィエ・バチェム。 訳抜け防止モード: Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar R satsch シルヴァン・ゲリー、ベルンハルト・シュ・オルコプフ、オリヴィエ・バシュム。
0.77
Challenging common assumptions in the unsupervised learning of disentangled representations.
不整合表現の教師なし学習における一般的な仮定を満たす。
0.46
In Proceeding of the International Conference on Machine Learning, 2019.
2019年 機械学習国際会議(international conference on machine learning)開催。
0.80
Yin-Jyun Luo, Kin Wai Cheuk, Tomoyasu Nakano, Masataka Goto, and Dorien Herremans.
隠慈雲、金和忠、中野知康、後藤政隆、鳥園ヘレマンズ。
0.44
Unsupervised disentanglement of pitch and timbre for isolated musical instrument sounds.
孤立した楽器の音のピッチと音色の教師なしの不連続
0.58
In Proceeding of the International Society for Music Information Retrieval, 2020.
国際音楽情報検索協会(International Society for Music Information Retrieval, 2020)設立。
0.73
Ashis Pati, Siddharth Gururani, and Alexander Lerch.
Ashis Pati、Siddharth Gururani、Alexander Lerch。
0.33
dmelodies: A music dataset for disentanglement learning.
dmelodies: 乱れ学習のための音楽データセット。
0.83
In Proceeding of the International Society of Music Information Retrieval, 2020.
国際音楽情報検索協会(International Society of Music Information Retrieval, 2020)の略。
0.68
Justin Salamon, Emilia G´omez, Daniel P. W. Ellis, and Ga¨el Richard.
ジャスティン・サラモン、エミリア・g・オメス、ダニエル・p・w・エリス、ガエル・リチャード。
0.44
Melody extraction from polyphonic music signals: IEEE Signal Approaches, applications, and challenges.
ポリフォニック音楽信号からのメロディ抽出:IEEE信号アプローチ、応用、課題
0.58
Processing Magazine, 2014.
2014年、雑誌出版。
0.71
Vaibhav Saxena, Jimmy Ba, and Danijar Hafner.
ヴァイブハヴ・ザクセン、ジミー・バ、ダニヤル・ハフナー。
0.39
Clockwork variational autoencoders.
時計工 変分オートエンコーダ。
0.53
ArXiv, 2021. Matthew James Vowels, Necati Cihan Camg¨oz, and Richard Bowden.