A comparison of self-supervised speech representations as input features
for unsupervised acoustic word embeddings
- URL: http://arxiv.org/abs/2012.07387v1
- Date: Mon, 14 Dec 2020 10:17:25 GMT
- Title: A comparison of self-supervised speech representations as input features
for unsupervised acoustic word embeddings
- Authors: Lisa van Staden, Herman Kamper
- Abstract summary: We look at representation learning at the short-time frame level.
Recent approaches include self-supervised predictive coding and correspondence autoencoder (CAE) models.
We compare frame-level features from contrastive predictive coding ( CPC), autoregressive predictive coding and a CAE to conventional MFCCs.
- Score: 32.59716743279858
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many speech processing tasks involve measuring the acoustic similarity
between speech segments. Acoustic word embeddings (AWE) allow for efficient
comparisons by mapping speech segments of arbitrary duration to
fixed-dimensional vectors. For zero-resource speech processing, where
unlabelled speech is the only available resource, some of the best AWE
approaches rely on weak top-down constraints in the form of automatically
discovered word-like segments. Rather than learning embeddings at the segment
level, another line of zero-resource research has looked at representation
learning at the short-time frame level. Recent approaches include
self-supervised predictive coding and correspondence autoencoder (CAE) models.
In this paper we consider whether these frame-level features are beneficial
when used as inputs for training to an unsupervised AWE model. We compare
frame-level features from contrastive predictive coding (CPC), autoregressive
predictive coding and a CAE to conventional MFCCs. These are used as inputs to
a recurrent CAE-based AWE model. In a word discrimination task on English and
Xitsonga data, all three representation learning approaches outperform MFCCs,
with CPC consistently showing the biggest improvement. In cross-lingual
experiments we find that CPC features trained on English can also be
transferred to Xitsonga.
Related papers
- Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding [1.07288078404291]
We propose a natural language understanding approach based on Automatic Speech Recognition (ASR)
We improve a noisy-channel model to handle transcription inconsistencies caused by ASR errors.
Experiments on four benchmark datasets show that Contrastive and Consistency Learning (CCL) outperforms existing methods.
arXiv Detail & Related papers (2024-05-23T23:10:23Z) - Unimodal Aggregation for CTC-based Speech Recognition [7.6112706449833505]
A unimodal aggregation (UMA) is proposed to segment and integrate the feature frames that belong to the same text token.
UMA learns better feature representations and shortens the sequence length, resulting in lower recognition error and computational complexity.
arXiv Detail & Related papers (2023-09-15T04:34:40Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Towards hate speech detection in low-resource languages: Comparing ASR
to acoustic word embeddings on Wolof and Swahili [16.424308444697015]
We consider hate speech detection through keyword spotting on radio broadcasts.
One approach is to build an automatic speech recognition system for the target low-resource language.
We compare this to using acoustic word embedding models that map speech segments to a space where matching words have similar vectors.
arXiv Detail & Related papers (2023-06-01T07:25:10Z) - Learning Context-aware Classifier for Semantic Segmentation [88.88198210948426]
In this paper, contextual hints are exploited via learning a context-aware classifier.
Our method is model-agnostic and can be easily applied to generic segmentation models.
With only negligible additional parameters and +2% inference time, decent performance gain has been achieved on both small and large models.
arXiv Detail & Related papers (2023-03-21T07:00:35Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Segmental Contrastive Predictive Coding for Unsupervised Word
Segmentation [33.35220574193796]
We propose a segmental contrastive predictive coding (SCPC) framework that can model the signal structure at a higher level e.g. at the phoneme level.
A differentiable boundary detector finds variable-length segments, which are then used to optimize a segment encoder via NCE.
We show that our single model outperforms existing phoneme and word segmentation methods on TIMIT and Buckeye datasets.
arXiv Detail & Related papers (2021-06-03T23:12:05Z) - Unsupervised feature learning for speech using correspondence and
Siamese networks [24.22616495324351]
We compare two recent methods for frame-level acoustic feature learning.
For both methods, unsupervised term discovery is used to find pairs of word examples of the same unknown type.
For the correspondence autoencoder (CAE), matching frames are presented as input-output pairs.
For the first time, these feature extractors are compared on the same discrimination tasks using the same weak supervision pairs.
arXiv Detail & Related papers (2020-03-28T14:31:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.