Related papers: Variable-rate discrete representation learning

Variable-rate discrete representation learning

URL: http://arxiv.org/abs/2103.06089v1
Date: Wed, 10 Mar 2021 14:42:31 GMT
Title: Variable-rate discrete representation learning
Authors: Sander Dieleman, Charlie Nash, Jesse Engel, Karen Simonyan
Abstract summary: We propose slow autoencoders for unsupervised learning of high-level variable-rate discrete representations of sequences. We show that the resulting event-based representations automatically grow or shrink depending on the density of salient information in the input signals. We develop run-length Transformers for event-based representation modelling and use them to construct language models in the speech domain.
Score: 20.81400194698063
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Semantically meaningful information content in perceptual signals is usually unevenly distributed. In speech signals for example, there are often many silences, and the speed of pronunciation can vary considerably. In this work, we propose slow autoencoders (SlowAEs) for unsupervised learning of high-level variable-rate discrete representations of sequences, and apply them to speech. We show that the resulting event-based representations automatically grow or shrink depending on the density of salient information in the input signals, while still allowing for faithful signal reconstruction. We develop run-length Transformers (RLTs) for event-based representation modelling and use them to construct language models in the speech domain, which are able to generate grammatical and semantically coherent utterances and continuations.

Related papers

A Variational Framework for Improving Naturalness in Generative Spoken Language Models [52.673912922590866]
We propose an end-to-end variational approach that automatically learns to encode continuous speech attributes to enhance semantic tokens.<n>Our approach eliminates the need for manual extraction and selection of paralinguistic features.<n>It produces preferred speech continuations according to human raters.
arXiv Detail & Related papers (2025-06-17T17:58:17Z)
StgcDiff: Spatial-Temporal Graph Condition Diffusion for Sign Language Transition Generation [33.695308849489784]
We propose StgcDiff, a graph-based conditional diffusion framework that generates smooth transitions between discrete signs.<n>Specifically, we train an encoder-decoder architecture to learn a structure-aware representation of spatial-temporal skeleton.<n>We design the Sign-GCN module as the key component in our framework, which effectively models the spatial-temporal features.
arXiv Detail & Related papers (2025-06-16T07:09:51Z)
SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units. Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering. SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z)
VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired. We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z)
SelfVC: Voice Conversion With Iterative Refinement using Self Transformations [42.97689861071184]
SelfVC is a training strategy to improve a voice conversion model with self-synthesized examples. We develop techniques to derive prosodic information from the audio signal and SSL representations to train predictive submodules in the synthesis model. Our framework is trained without any text and achieves state-of-the-art results in zero-shot voice conversion on metrics evaluating naturalness, speaker similarity, and intelligibility of synthesized audio.
arXiv Detail & Related papers (2023-10-14T19:51:17Z)
Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies. Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations. Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z)
TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture. We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning. We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z)
Bidirectional Representations for Low Resource Spoken Language Understanding [39.208462511430554]
We propose a representation model to encode speech in bidirectional rich encodings. The approach uses a masked language modelling objective to learn the representations. We show that the performance of the resulting encodings is better than comparable models on multiple datasets.
arXiv Detail & Related papers (2022-11-24T17:05:16Z)
Learning and controlling the source-filter representation of speech with a variational autoencoder [23.05989605017053]
In speech processing, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors. We propose a method to accurately and independently control the source-filter speech factors within the latent subspaces. Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms.
arXiv Detail & Related papers (2022-04-14T16:13:06Z)
Modeling Intensification for Sign Language Generation: A Computational Approach [13.57903290481737]
End-to-end sign language generation models do not accurately represent the prosody in sign language. We aim to improve the prosody in generated sign languages by modeling intensification in a data-driven manner. We find that our efforts in intensification modeling yield better results when evaluated with automatic metrics.
arXiv Detail & Related papers (2022-03-18T01:13:21Z)
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement. We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal. In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective. Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z)
Fast End-to-End Speech Recognition via a Non-Autoregressive Model and Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once) The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.