PSST! Prosodic Speech Segmentation with Transformers
- URL: http://arxiv.org/abs/2302.01984v1
- Date: Fri, 3 Feb 2023 20:09:17 GMT
- Title: PSST! Prosodic Speech Segmentation with Transformers
- Authors: Nathan Roll, Calbert Graham, Simon Todd
- Abstract summary: We finetune Whisper, a pretrained STT model, to annotate unit boundaries by repurposing low-frequency tokens.
Our approach achieves an accuracy of 95.8%, outperforming previous methods without the need for large-scale labeled data.
- Score: 1.3535770763481905
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-attention mechanisms have enabled transformers to achieve
superhuman-level performance on many speech-to-text (STT) tasks, yet the
challenge of automatic prosodic segmentation has remained unsolved. In this
paper we finetune Whisper, a pretrained STT model, to annotate intonation unit
(IU) boundaries by repurposing low-frequency tokens. Our approach achieves an
accuracy of 95.8%, outperforming previous methods without the need for
large-scale labeled data or enterprise grade compute resources. We also
diminish input signals by applying a series of filters, finding that low pass
filters at a 3.2 kHz level improve segmentation performance in out of sample
and out of distribution contexts. We release our model as both a transcription
tool and a baseline for further improvements in prosodic segmentation.
Related papers
- DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - Only 5\% Attention Is All You Need: Efficient Long-range Document-level
Neural Machine Translation [70.87670058323239]
Document-level Neural Machine Translation (DocNMT) has been proven crucial for handling discourse phenomena by introducing document-level context information.
One of the most important directions is to input the whole document directly to the standard Transformer model.
In this work, we keep the translation performance while gaining 20% speed up by introducing extra selection layer based on lightweight attention that selects a small portion of tokens to be attended.
arXiv Detail & Related papers (2023-09-25T14:33:47Z) - ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation [21.335983674309475]
Diffusion models suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation.
We introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query.
We achieve so by proposing "CFG-aware latent consistency model," which adapts consistency generation into a latent space.
arXiv Detail & Related papers (2023-09-19T16:36:33Z) - CHAPTER: Exploiting Convolutional Neural Network Adapters for
Self-supervised Speech Models [62.60723685118747]
Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data.
We propose an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor.
We empirically found that adding CNN to the feature extractor can help the adaptation on emotion and speaker tasks.
arXiv Detail & Related papers (2022-12-01T08:50:12Z) - Augmenting Transformer-Transducer Based Speaker Change Detection With
Token-Level Training Loss [15.304831835680847]
We propose a novel token-based training strategy that improves Transformer-Transducer (T-T) based speaker change detection (SCD) performance.
Due to the sparsity of the speaker changes in the training data, the conventional T-T based SCD model loss leads to sub-optimal detection accuracy.
arXiv Detail & Related papers (2022-11-11T21:09:58Z) - The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in
Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.
We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks.
We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - An Improved Single Step Non-autoregressive Transformer for Automatic
Speech Recognition [28.06475768075206]
Non-autoregressive mechanisms can significantly decrease inference time for speech transformers.
Previous work on CTC alignment-based single step non-autoregressive transformer (CASS-NAT) has shown a large real time factor (RTF) improvement over autoregressive transformers (AT)
We propose several methods to improve the accuracy of the end-to-end CASS-NAT, followed by performance analyses.
arXiv Detail & Related papers (2021-06-18T02:58:30Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Adapting End-to-End Speech Recognition for Readable Subtitles [15.525314212209562]
In some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time.
We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech.
Experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities.
arXiv Detail & Related papers (2020-05-25T14:42:26Z) - Weak-Attention Suppression For Transformer Based Speech Recognition [33.30436927415777]
We propose Weak-Attention Suppression (WAS), a method that dynamically induces sparsity in attention probabilities.
We demonstrate that WAS leads to consistent Word Error Rate (WER) improvement over strong transformer baselines.
arXiv Detail & Related papers (2020-05-18T23:49:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.