Related papers: Pitch Accent Detection improves Pretrained Automatic Speech Recognition

Pitch Accent Detection improves Pretrained Automatic Speech Recognition

URL: http://arxiv.org/abs/2508.04814v1
Date: Wed, 06 Aug 2025 18:52:05 GMT
Title: Pitch Accent Detection improves Pretrained Automatic Speech Recognition
Authors: David Sasu, Natalie Schluter,
Abstract summary: The pitch accent detection component of our model achieves a significant improvement on the state-of-the-art for the task.<n>We show the importance of extending pretrained speech models to retain or re-learn important prosodic cues such as pitch accent.
Score: 2.5322020135765464
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We show the performance of Automatic Speech Recognition (ASR) systems that use semi-supervised speech representations can be boosted by a complimentary pitch accent detection module, by introducing a joint ASR and pitch accent detection model. The pitch accent detection component of our model achieves a significant improvement on the state-of-the-art for the task, closing the gap in F1-score by 41%. Additionally, the ASR performance in joint training decreases WER by 28.3% on LibriSpeech, under limited resource fine-tuning. With these results, we show the importance of extending pretrained speech models to retain or re-learn important prosodic cues such as pitch accent.

Related papers

Seewo's Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models [4.917936997225074]
Seewo's systems for both tracks of the Multilingual Conversational Speech Language Model Challenge (MLC-SLM)<n>We introduce a multi-stage training pipeline that explicitly enhances reasoning and self-correction in speech language models for ASR.
arXiv Detail & Related papers (2025-06-16T09:42:05Z)
Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM [53.17360668423001]
Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation.<n>This work proposes a speaker-aware progressive OSD model that leverages a progressive training strategy to enhance the correlation between subtasks.<n> Experimental results show that the proposed method achieves state-of-the-art performance, with an F1 score of 82.76% on the AMI test set.
arXiv Detail & Related papers (2025-05-29T07:47:48Z)
Improving Self-supervised Pre-training using Accent-Specific Codebooks [48.409296549372414]
accent-aware adaptation technique for self-supervised learning. On the Mozilla Common Voice dataset, our proposed approach outperforms all other accent-adaptation approaches.
arXiv Detail & Related papers (2024-07-04T08:33:52Z)
MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders. Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z)
Accented Speech Recognition With Accent-specific Codebooks [53.288874858671576]
Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems. Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR. We propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks.
arXiv Detail & Related papers (2023-10-24T16:10:58Z)
Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model. A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z)
Device Directedness with Contextual Cues for Spoken Dialog Systems [15.96415881820669]
We define barge-in verification as a supervised learning task where audio-only information is used to classify user spoken dialogue into true and false barge-ins. We use low-level speech representations from a self-supervised representation learning model for our downstream classification task. We propose a novel technique to infuse lexical information directly into speech representations to improve the domain-specific language information implicitly learned during pre-training.
arXiv Detail & Related papers (2022-11-23T19:49:11Z)
Improved far-field speech recognition using Joint Variational Autoencoder [5.320201231911981]
We propose mapping speech features from far-field to close-talk using denoising autoencoder (DA) Specifically, we observe an absolute improvement of 2.5% in word error rate (WER) compared to DA based enhancement and 3.96% compared to AM trained directly on far-field filterbank features.
arXiv Detail & Related papers (2022-04-24T14:14:04Z)
Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments. We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition. We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.