Pitch Accent Detection improves Pretrained Automatic Speech Recognition
- URL: http://arxiv.org/abs/2508.04814v1
- Date: Wed, 06 Aug 2025 18:52:05 GMT
- Title: Pitch Accent Detection improves Pretrained Automatic Speech Recognition
- Authors: David Sasu, Natalie Schluter,
- Abstract summary: The pitch accent detection component of our model achieves a significant improvement on the state-of-the-art for the task.<n>We show the importance of extending pretrained speech models to retain or re-learn important prosodic cues such as pitch accent.
- Score: 2.5322020135765464
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We show the performance of Automatic Speech Recognition (ASR) systems that use semi-supervised speech representations can be boosted by a complimentary pitch accent detection module, by introducing a joint ASR and pitch accent detection model. The pitch accent detection component of our model achieves a significant improvement on the state-of-the-art for the task, closing the gap in F1-score by 41%. Additionally, the ASR performance in joint training decreases WER by 28.3% on LibriSpeech, under limited resource fine-tuning. With these results, we show the importance of extending pretrained speech models to retain or re-learn important prosodic cues such as pitch accent.
Related papers
- Seewo's Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models [4.917936997225074]
Seewo's systems for both tracks of the Multilingual Conversational Speech Language Model Challenge (MLC-SLM)<n>We introduce a multi-stage training pipeline that explicitly enhances reasoning and self-correction in speech language models for ASR.
arXiv Detail & Related papers (2025-06-16T09:42:05Z) - Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM [53.17360668423001]
Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation.<n>This work proposes a speaker-aware progressive OSD model that leverages a progressive training strategy to enhance the correlation between subtasks.<n> Experimental results show that the proposed method achieves state-of-the-art performance, with an F1 score of 82.76% on the AMI test set.
arXiv Detail & Related papers (2025-05-29T07:47:48Z) - Improving Self-supervised Pre-training using Accent-Specific Codebooks [48.409296549372414]
accent-aware adaptation technique for self-supervised learning.
On the Mozilla Common Voice dataset, our proposed approach outperforms all other accent-adaptation approaches.
arXiv Detail & Related papers (2024-07-04T08:33:52Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Accented Speech Recognition With Accent-specific Codebooks [53.288874858671576]
Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems.
Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR.
We propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks.
arXiv Detail & Related papers (2023-10-24T16:10:58Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - Device Directedness with Contextual Cues for Spoken Dialog Systems [15.96415881820669]
We define barge-in verification as a supervised learning task where audio-only information is used to classify user spoken dialogue into true and false barge-ins.
We use low-level speech representations from a self-supervised representation learning model for our downstream classification task.
We propose a novel technique to infuse lexical information directly into speech representations to improve the domain-specific language information implicitly learned during pre-training.
arXiv Detail & Related papers (2022-11-23T19:49:11Z) - Improved far-field speech recognition using Joint Variational
Autoencoder [5.320201231911981]
We propose mapping speech features from far-field to close-talk using denoising autoencoder (DA)
Specifically, we observe an absolute improvement of 2.5% in word error rate (WER) compared to DA based enhancement and 3.96% compared to AM trained directly on far-field filterbank features.
arXiv Detail & Related papers (2022-04-24T14:14:04Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.