Opening the Black Box of wav2vec Feature Encoder
- URL: http://arxiv.org/abs/2210.15386v1
- Date: Thu, 27 Oct 2022 12:47:35 GMT
- Title: Opening the Black Box of wav2vec Feature Encoder
- Authors: Kwanghee Choi, Eun Jung Yeo
- Abstract summary: We focus on the convolutional feature encoder where its latent space is often speculated to represent discrete acoustic units.
To analyze the embedding space in a reductive manner, we feed the synthesized audio signals, which is the summation of simple sine waves.
We conclude that various information is embedded inside the feature encoder representations: (1) fundamental frequency, (2) formants, and (3) amplitude, packed with (4) sufficient temporal detail.
- Score: 2.1219431687928525
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised models, namely, wav2vec and its variants, have shown
promising results in various downstream tasks in the speech domain. However,
their inner workings are poorly understood, calling for in-depth analyses on
what the model learns. In this paper, we concentrate on the convolutional
feature encoder where its latent space is often speculated to represent
discrete acoustic units. To analyze the embedding space in a reductive manner,
we feed the synthesized audio signals, which is the summation of simple sine
waves. Through extensive experiments, we conclude that various information is
embedded inside the feature encoder representations: (1) fundamental frequency,
(2) formants, and (3) amplitude, packed with (4) sufficient temporal detail.
Further, the information incorporated inside the latent representations is
analogous to spectrograms but with a fundamental difference: latent
representations construct a metric space so that closer representations imply
acoustic similarity.
Related papers
- Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting [14.402357651227003]
We investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context.
To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder.
arXiv Detail & Related papers (2024-05-30T14:41:39Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Triple-View Knowledge Distillation for Semi-Supervised Semantic
Segmentation [54.23510028456082]
We propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation.
The framework includes the triple-view encoder and the dual-frequency decoder.
arXiv Detail & Related papers (2023-09-22T01:02:21Z) - Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal
Distillation [44.940531391847]
We address the challenge of dense indoor prediction with sound in 2D and 3D via cross-modal knowledge distillation.
We are the first to tackle dense indoor prediction of omnidirectional surroundings in both 2D and 3D with audio observations.
For audio-based depth estimation, semantic segmentation, and challenging 3D scene reconstruction, the proposed distillation framework consistently achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-09-20T06:07:04Z) - BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for
Binaural Audio Synthesis [129.86743102915986]
We formulate the synthesis process from a different perspective by decomposing the audio into a common part.
We propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively.
Experiment results show that BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics.
arXiv Detail & Related papers (2022-05-30T02:09:26Z) - Learning and controlling the source-filter representation of speech with
a variational autoencoder [23.05989605017053]
In speech processing, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors.
We propose a method to accurately and independently control the source-filter speech factors within the latent subspaces.
Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms.
arXiv Detail & Related papers (2022-04-14T16:13:06Z) - Voice Activity Detection for Transient Noisy Environment Based on
Diffusion Nets [13.558688470594674]
We address voice activity detection in acoustic environments of transients and stationary noises.
We exploit unique spatial patterns of speech and non-speech audio frames by independently learning their underlying geometric structure.
A deep neural network is trained to separate speech from non-speech frames.
arXiv Detail & Related papers (2021-06-25T17:05:26Z) - Neural Granular Sound Synthesis [53.828476137089325]
Granular sound synthesis is a popular audio generation technique based on rearranging sequences of small waveform windows.
We show that generative neural networks can implement granular synthesis while alleviating most of its shortcomings.
arXiv Detail & Related papers (2020-08-04T08:08:00Z) - Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features.
We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution.
We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.