Related papers: Low-resource speech recognition and dialect identification of Irish in a multi-task framework

Low-resource speech recognition and dialect identification of Irish in a multi-task framework

URL: http://arxiv.org/abs/2405.01293v1
Date: Thu, 2 May 2024 13:54:39 GMT
Title: Low-resource speech recognition and dialect identification of Irish in a multi-task framework
Authors: Liam Lonergan, Mengjie Qian, Neasa Ní Chiaráin, Christer Gobl, Ailbhe Ní Chasaide,
Abstract summary: This paper explores the use of Hybrid CTC/Attention encoder-decoder models trained with Intermediate CTC (Inter CTC) for Irish (Gaelic) low-resource speech recognition (ASR) and dialect identification (DID) Results are compared to the current best performing models trained for ASR (TDNN-HMM) and DIDECA (PA-TDNN)
Score: 7.981589711420179
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper explores the use of Hybrid CTC/Attention encoder-decoder models trained with Intermediate CTC (InterCTC) for Irish (Gaelic) low-resource speech recognition (ASR) and dialect identification (DID). Results are compared to the current best performing models trained for ASR (TDNN-HMM) and DID (ECAPA-TDNN). An optimal InterCTC setting is initially established using a Conformer encoder. This setting is then used to train a model with an E-branchformer encoder and the performance of both architectures are compared. A multi-task fine-tuning approach is adopted for language model (LM) shallow fusion. The experiments yielded an improvement in DID accuracy of 10.8% relative to a baseline ECAPA-TDNN, and WER performance approaching the TDNN-HMM model. This multi-task approach emerges as a promising strategy for Irish low-resource ASR and DID.

Related papers

CTC-DID: CTC-Based Arabic dialect identification for streaming applications [5.548949834680307]
CTC-DID frames the dialect identification task as a limited-vocabulary ASR system.<n>An SSL-based CTC-DID model, trained on a limited dataset, outperforms both fine-tuned Whisper and ECAPA-TDNN models.<n>The proposed approach is found to be more robust to shorter utterances and is shown to be easily adaptable for streaming, real-time applications.
arXiv Detail & Related papers (2026-01-18T00:11:02Z)
Large-Scale Model Enabled Semantic Communication Based on Robust Knowledge Distillation [53.16213723669751]
Large-scale models (LSMs) can be an effective framework for semantic representation and understanding.<n>However, their direct deployment is often hindered by high computational complexity and resource requirements.<n>This paper proposes a novel knowledge distillation based semantic communication framework.
arXiv Detail & Related papers (2025-08-04T07:47:18Z)
Tailored Design of Audio-Visual Speech Recognition Models using Branchformers [0.0]
We propose a novel framework for the design of parameter-efficient Audio-Visual Speech Recognition systems. To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder. Our models achieve competitive word error rates (WER) of approximately 2.5% for English and surpass existing approaches for Spanish.
arXiv Detail & Related papers (2024-07-09T07:15:56Z)
Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter. The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates. The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z)
Improving Zero-Shot Chinese-English Code-Switching ASR with kNN-CTC and Gated Monolingual Datastores [14.150602045545108]
We propose a novel kNN-CTC-based code-switching ASR (CS-ASR) framework that employs dual monolingual datastores and a gated datastore selection mechanism. Our method selects the appropriate datastore for decoding each frame, ensuring the injection of language-specific information into the ASR process.
arXiv Detail & Related papers (2024-06-06T07:39:17Z)
Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker. Network addresses limitations of SIMO models by aggregating cross-speaker representations. Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z)
Improving Massively Multilingual ASR With Auxiliary CTC Objectives [40.10307386370194]
We introduce our work on improving performance on FLEURS, a 102-language open ASR benchmark. We investigate techniques inspired from recent Connectionist Temporal Classification ( CTC) studies to help the model handle the large number of languages. Our state-of-the-art systems using self-supervised models with the Conformer architecture improve over the results of prior work on FLEURS by a relative 28.4% CER.
arXiv Detail & Related papers (2023-02-24T18:59:51Z)
Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks. adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations. In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z)
Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification [14.197869575012925]
We propose to modify the structure of the cascaded-encoder-based recurrent neural network transducer (RNN-T) model by integrating a per-frame language identifier (LID) predictor. RNN-T with cascaded encoders can achieve streaming ASR with low latency using first-pass decoding with no right-context, and achieve lower word error rates (WERs) using second-pass decoding with longer right-context. Experimental results on a voice search dataset with 9 language locales shows that the proposed method achieves an average of 96.2% LID prediction accuracy and the same second-pass WER
arXiv Detail & Related papers (2022-09-13T15:10:41Z)
LegoNN: Building Modular Encoder-Decoder Models [117.47858131603112]
State-of-the-art encoder-decoder models are constructed and trained end-to-end as an atomic unit. No component of the model can be (re-)used without the others, making it impossible to share parts. We describe LegoNN, a procedure for building encoder-decoder architectures in a way so that its parts can be applied to other tasks without the need for fine-tuning.
arXiv Detail & Related papers (2022-06-07T14:08:07Z)
Integrate Lattice-Free MMI into End-to-End Speech Recognition [87.01137882072322]
In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems. With this motivation, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems. Previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems. In this work, novel algorithms are proposed in this work to integrate another widely used discriminative criterion, lattice-free maximum mutual information (LF-MMI) into E2E
arXiv Detail & Related papers (2022-03-29T14:32:46Z)
Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces and Conformers [33.725831884078744]
The proposed CTC-CRF framework inherits the data-efficiency of the hybrid approach and the simplicity of the end-to-end approach. We investigate techniques to enable the recently developed wordpiece modeling units and Conformer neural networks to be succesfully applied in CTC-CRFs.
arXiv Detail & Related papers (2021-07-07T04:12:06Z)
Intermediate Loss Regularization for CTC-based Speech Recognition [58.33721897180646]
We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification ( CTC) objective. We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively.
arXiv Detail & Related papers (2021-02-05T15:01:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.