Cognitive Coding of Speech
- URL: http://arxiv.org/abs/2110.04241v1
- Date: Fri, 8 Oct 2021 16:49:16 GMT
- Title: Cognitive Coding of Speech
- Authors: Reza Lotfidereshgi and Philippe Gournay
- Abstract summary: We propose an approach for cognitive coding of speech by unsupervised extraction of contextual representations in two hierarchical levels of abstraction.
This decomposition is achieved by a two-stage neural network, with a lower and an upper stage operating at different time scales.
With an application in speech compression in mind, we investigate the effect of dimensionality reduction and low quantization on the extracted representations.
- Score: 6.396288020763143
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose an approach for cognitive coding of speech by unsupervised
extraction of contextual representations in two hierarchical levels of
abstraction. Speech attributes such as phoneme identity that last one hundred
milliseconds or less are captured in the lower level of abstraction, while
speech attributes such as speaker identity and emotion that persist up to one
second are captured in the higher level of abstraction. This decomposition is
achieved by a two-stage neural network, with a lower and an upper stage
operating at different time scales. Both stages are trained to predict the
content of the signal in their respective latent spaces. A top-down pathway
between stages further improves the predictive capability of the network. With
an application in speech compression in mind, we investigate the effect of
dimensionality reduction and low bitrate quantization on the extracted
representations. The performance measured on the LibriSpeech and EmoV-DB
datasets reaches, and for some speech attributes even exceeds, that of
state-of-the-art approaches.
Related papers
- Towards the Next Frontier in Speech Representation Learning Using Disentanglement [34.21745744502759]
We propose a framework for Learning Disentangled Self Supervised (termed as Learn2Diss) representations of speech, which consists of frame-level and an utterance-level encoder modules.
We show that the proposed Learn2Diss achieves state-of-the-art results on a variety of tasks, with the frame-level encoder representations improving semantic tasks, while the utterance-level representations improve non-semantic tasks.
arXiv Detail & Related papers (2024-07-02T07:13:35Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - Representation Learning With Hidden Unit Clustering For Low Resource
Speech Applications [37.89857769906568]
We describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework.
The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers.
The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations.
arXiv Detail & Related papers (2023-07-14T13:02:10Z) - Learning utterance-level representations through token-level acoustic
latents prediction for Expressive Speech Synthesis [3.691712391306624]
We show that the fine-grained latent space also captures coarse-grained information, which is more evident as the dimension of latent space increases in order to capture diverse prosodic representations.
We alleviate this issue by first capturing rich speech attributes into a token-level latent space and then, separately train a prior network that given the input text, learns utterance-level representations in order to predict the phoneme-level, posterior latents extracted during the previous step.
arXiv Detail & Related papers (2022-11-01T15:17:25Z) - ESSumm: Extractive Speech Summarization from Untranscribed Meeting [7.309214379395552]
We propose a novel architecture for direct extractive speech-to-speech summarization, ESSumm.
We leverage the off-the-shelf self-supervised convolutional neural network to extract the deep speech features from raw audio.
Our approach automatically predicts the optimal sequence of speech segments that capture the key information with a target summary length.
arXiv Detail & Related papers (2022-09-14T20:13:15Z) - Toward a realistic model of speech processing in the brain with
self-supervised learning [67.7130239674153]
Self-supervised algorithms trained on the raw waveform constitute a promising candidate.
We show that Wav2Vec 2.0 learns brain-like representations with as little as 600 hours of unlabelled speech.
arXiv Detail & Related papers (2022-06-03T17:01:46Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - WavThruVec: Latent speech representation as intermediate features for
neural speech synthesis [1.1470070927586016]
WavThruVec is a two-stage architecture that resolves the bottleneck by using high-dimensional Wav2Vec 2.0 embeddings as intermediate speech representation.
We show that the proposed model not only matches the quality of state-of-the-art neural models, but also presents useful properties enabling tasks like voice conversion or zero-shot synthesis.
arXiv Detail & Related papers (2022-03-31T10:21:08Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.