Acoustic Neighbor Embeddings
- URL: http://arxiv.org/abs/2007.10329v5
- Date: Thu, 6 Jan 2022 23:14:11 GMT
- Title: Acoustic Neighbor Embeddings
- Authors: Woojay Jeon
- Abstract summary: This paper proposes a novel acoustic word embedding called Acoustic Neighbor Embeddings.
The Euclidean distance between coordinates in the embedding space reflects the phonetic confusability between their corresponding sequences.
The recognition accuracy is identical to that of conventional finite state transducer(FST)-based decoding using test data with up to 1 million names in the vocabulary and 40 dimensions in the embeddings.
- Score: 2.842794675894731
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a novel acoustic word embedding called Acoustic Neighbor
Embeddings where speech or text of arbitrary length are mapped to a vector
space of fixed, reduced dimensions by adapting stochastic neighbor embedding
(SNE) to sequential inputs. The Euclidean distance between coordinates in the
embedding space reflects the phonetic confusability between their corresponding
sequences. Two encoder neural networks are trained: an acoustic encoder that
accepts speech signals in the form of frame-wise subword posterior
probabilities obtained from an acoustic model and a text encoder that accepts
text in the form of subword transcriptions. Compared to a triplet loss
criterion, the proposed method is shown to have more effective gradients for
neural network training. Experimentally, it also gives more accurate results
with low-dimensional embeddings when the two encoder networks are used in
tandem in a word (name) recognition task, and when the text encoder network is
used standalone in an approximate phonetic matching task. In particular, in an
isolated name recognition task depending solely on Euclidean nearest-neighbor
search between the proposed embedding vectors, the recognition accuracy is
identical to that of conventional finite state transducer(FST)-based decoding
using test data with up to 1 million names in the vocabulary and 40 dimensions
in the embeddings.
Related papers
- Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding [5.697227044927832]
We propose a novel architecture to efficiently detect arbitrary keywords based on an audio-compliant text encoder.
Our text encoder converts the text to phonemes using a grapheme-to-phoneme (G2P) model, and then to an embedding using representative phoneme vectors.
Experimental results show that our scheme outperforms the state-of-the-art results on Libriphrase hard dataset.
arXiv Detail & Related papers (2023-08-12T05:41:15Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Voice Activity Detection for Transient Noisy Environment Based on
Diffusion Nets [13.558688470594674]
We address voice activity detection in acoustic environments of transients and stationary noises.
We exploit unique spatial patterns of speech and non-speech audio frames by independently learning their underlying geometric structure.
A deep neural network is trained to separate speech from non-speech frames.
arXiv Detail & Related papers (2021-06-25T17:05:26Z) - Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and
Backward Transformers [49.403414751667135]
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR)
The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem.
Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
arXiv Detail & Related papers (2021-04-21T03:05:12Z) - Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings [28.04666950237383]
We consider segmental models for whole-word ("acoustic-to-word") speech recognition.
We describe an efficient approach for end-to-end whole-word segmental models.
We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation.
arXiv Detail & Related papers (2020-07-01T02:22:09Z) - Acoustic Word Embedding System for Code-Switching Query-by-example
Spoken Term Detection [17.54377669932433]
We propose a deep convolutional neural network-based acoustic word embedding system on code-switching query by example spoken term detection.
We combine audio data in two languages for training instead of only using one single language.
arXiv Detail & Related papers (2020-05-24T15:27:56Z) - Unsupervised Cross-Modal Audio Representation Learning from Unstructured
Multilingual Text [69.55642178336953]
We present an approach to unsupervised audio representation learning.
Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness.
We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection.
arXiv Detail & Related papers (2020-03-27T07:37:15Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.