Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings
- URL: http://arxiv.org/abs/2007.00183v2
- Date: Tue, 24 Nov 2020 17:03:52 GMT
- Title: Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings
- Authors: Bowen Shi, Shane Settle, Karen Livescu
- Abstract summary: We consider segmental models for whole-word ("acoustic-to-word") speech recognition.
We describe an efficient approach for end-to-end whole-word segmental models.
We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation.
- Score: 28.04666950237383
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Segmental models are sequence prediction models in which scores of hypotheses
are based on entire variable-length segments of frames. We consider segmental
models for whole-word ("acoustic-to-word") speech recognition, with the feature
vectors defined using vector embeddings of segments. Such models are
computationally challenging as the number of paths is proportional to the
vocabulary size, which can be orders of magnitude larger than when using
subword units like phones. We describe an efficient approach for end-to-end
whole-word segmental models, with forward-backward and Viterbi decoding
performed on a GPU and a simple segment scoring function that reduces space
complexity. In addition, we investigate the use of pre-training via jointly
trained acoustic word embeddings (AWEs) and acoustically grounded word
embeddings (AGWEs) of written word labels. We find that word error rate can be
reduced by a large margin by pre-training the acoustic segment representation
with AWEs, and additional (smaller) gains can be obtained by pre-training the
word prediction layer with AGWEs. Our final models improve over prior A2W
models.
Related papers
- Revisiting speech segmentation and lexicon learning with better features [29.268728666438495]
We revisit a self-supervised method that segments unlabelled speech into word-like segments.
We start from the two-stage duration-penalised dynamic programming method.
In the first acoustic unit discovery stage, we replace contrastive predictive coding features with HuBERT.
After word segmentation in the second stage, we get an acoustic word embedding for each segment by averaging HuBERT features.
arXiv Detail & Related papers (2024-01-31T15:06:34Z) - Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose an self-supervised continual learning approach to recognize new words.
We use a memory-enhanced Automatic Speech Recognition model from previous work.
We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - Context-aware Fine-tuning of Self-supervised Speech Models [56.95389222319555]
We study the use of context, i.e., surrounding segments, during fine-tuning.
We propose a new approach called context-aware fine-tuning.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks.
arXiv Detail & Related papers (2022-12-16T15:46:15Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Word Segmentation on Discovered Phone Units with Dynamic Programming and
Self-Supervised Scoring [23.822788597966646]
Recent work on unsupervised speech segmentation has used self-supervised models with a phone segmentation module and a word segmentation module that are trained jointly.
This paper compares this joint methodology with an older idea: bottom-up phone-like unit discovery is performed first, and symbolic word segmentation is then performed on top of the discovered units.
I specifically describe a duration-penalized dynamic programming (DPDP) procedure that can be used for either phone or word segmentation by changing the self-supervised scoring network that gives segment costs.
arXiv Detail & Related papers (2022-02-24T07:02:56Z) - On the Difficulty of Segmenting Words with Attention [32.97060026226872]
We show, however, that even on monolingual data this approach is brittle.
In experiments with different input types, data sizes, and segmentation algorithms, only models trained to predict phones from words succeed in the task.
arXiv Detail & Related papers (2021-09-21T11:37:08Z) - Segmental Contrastive Predictive Coding for Unsupervised Word
Segmentation [33.35220574193796]
We propose a segmental contrastive predictive coding (SCPC) framework that can model the signal structure at a higher level e.g. at the phoneme level.
A differentiable boundary detector finds variable-length segments, which are then used to optimize a segment encoder via NCE.
We show that our single model outperforms existing phoneme and word segmentation methods on TIMIT and Buckeye datasets.
arXiv Detail & Related papers (2021-06-03T23:12:05Z) - SChME at SemEval-2020 Task 1: A Model Ensemble for Detecting Lexical
Semantic Change [58.87961226278285]
This paper describes SChME, a method used in SemEval-2020 Task 1 on unsupervised detection of lexical semantic change.
SChME usesa model ensemble combining signals of distributional models (word embeddings) and wordfrequency models where each model casts a vote indicating the probability that a word sufferedsemantic change according to that feature.
arXiv Detail & Related papers (2020-12-02T23:56:34Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Analyzing autoencoder-based acoustic word embeddings [37.78342106714364]
Acoustic word embeddings (AWEs) are representations of words which encode their acoustic features.
We analyze basic properties of AWE spaces learned by a sequence-to-sequence encoder-decoder model in six typologically diverse languages.
AWEs exhibit a word onset bias, similar to patterns reported in various studies on human speech processing and lexical access.
arXiv Detail & Related papers (2020-04-03T16:11:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.