CTC-DID: CTC-Based Arabic dialect identification for streaming applications
- URL: http://arxiv.org/abs/2601.12199v1
- Date: Sun, 18 Jan 2026 00:11:02 GMT
- Title: CTC-DID: CTC-Based Arabic dialect identification for streaming applications
- Authors: Muhammad Umar Farooq, Oscar Saz,
- Abstract summary: CTC-DID frames the dialect identification task as a limited-vocabulary ASR system.<n>An SSL-based CTC-DID model, trained on a limited dataset, outperforms both fine-tuned Whisper and ECAPA-TDNN models.<n>The proposed approach is found to be more robust to shorter utterances and is shown to be easily adaptable for streaming, real-time applications.
- Score: 5.548949834680307
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes a Dialect Identification (DID) approach inspired by the Connectionist Temporal Classification (CTC) loss function as used in Automatic Speech Recognition (ASR). CTC-DID frames the dialect identification task as a limited-vocabulary ASR system, where dialect tags are treated as a sequence of labels for a given utterance. For training, the repetition of dialect tags in transcriptions is estimated either using a proposed Language-Agnostic Heuristic (LAH) approach or a pre-trained ASR model. The method is evaluated on the low-resource Arabic Dialect Identification (ADI) task, with experimental results demonstrating that an SSL-based CTC-DID model, trained on a limited dataset, outperforms both fine-tuned Whisper and ECAPA-TDNN models. Notably, CTC-DID also surpasses these models in zero-shot evaluation on the Casablanca dataset. The proposed approach is found to be more robust to shorter utterances and is shown to be easily adaptable for streaming, real-time applications, with minimal performance degradation.
Related papers
- Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech [51.14752758616364]
Speech-based depression detection (SDD) is a promising, non-invasive alternative to traditional clinical assessments.<n>We propose HAREN-CTC, a novel architecture that integrates multi-layer SSL features using cross-attention within a multitask learning framework.<n>The model achieves state-of-the-art macro F1-scores of 0.81 on DAIC-WOZ and 0.82 on MODMA, outperforming prior methods across both evaluation scenarios.
arXiv Detail & Related papers (2025-10-05T09:32:12Z) - Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition [8.775527128005136]
This paper presents a novel framework for multi-talker automatic speech recognition without the need for auxiliary information.<n>Speaker-Distinguishable CTC (SD-CTC) is an extension of CTC that jointly assigns a token and its corresponding speaker label to each frame.<n>We show that multi-task learning with SD-CTC and SOT reduces the error rate of the SOT model by 26% and achieves performance comparable to state-of-the-art methods relying on auxiliary information.
arXiv Detail & Related papers (2025-06-09T07:43:43Z) - Focused Discriminative Training For Streaming CTC-Trained Automatic Speech Recognition Models [5.576934300567641]
This paper introduces a novel training framework called Focused Discriminative Training (FDT) to improve streaming word-piece end-to-end (E2E) automatic speech recognition (ASR) models.
The proposed approach presents a novel framework to identify and improve a model's recognition on challenging segments of an audio.
arXiv Detail & Related papers (2024-08-23T11:54:25Z) - Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding [1.07288078404291]
We propose a natural language understanding approach based on Automatic Speech Recognition (ASR)
We improve a noisy-channel model to handle transcription inconsistencies caused by ASR errors.
Experiments on four benchmark datasets show that Contrastive and Consistency Learning (CCL) outperforms existing methods.
arXiv Detail & Related papers (2024-05-23T23:10:23Z) - Low-resource speech recognition and dialect identification of Irish in a multi-task framework [7.981589711420179]
This paper explores the use of Hybrid CTC/Attention encoder-decoder models trained with Intermediate CTC (Inter CTC) for Irish (Gaelic) low-resource speech recognition (ASR) and dialect identification (DID)
Results are compared to the current best performing models trained for ASR (TDNN-HMM) and DIDECA (PA-TDNN)
arXiv Detail & Related papers (2024-05-02T13:54:39Z) - Leveraging Language ID to Calculate Intermediate CTC Loss for Enhanced
Code-Switching Speech Recognition [5.3545957730615905]
We introduce language identification information into the middle layer of the ASR model's encoder.
We aim to generate acoustic features that imply language distinctions in a more implicit way, reducing the model's confusion when dealing with language switching.
arXiv Detail & Related papers (2023-12-15T07:46:35Z) - Intermediate Loss Regularization for CTC-based Speech Recognition [58.33721897180646]
We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification ( CTC) objective.
We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively.
arXiv Detail & Related papers (2021-02-05T15:01:03Z) - A Study on Effects of Implicit and Explicit Language Model Information
for DBLSTM-CTC Based Handwriting Recognition [51.36957172200015]
We study the effects of implicit and explicit language model information for DBLSTM-CTC based handwriting recognition.
Even using one million lines of training sentences to train the DBLSTM, using an explicit language model is still helpful.
arXiv Detail & Related papers (2020-07-31T08:23:37Z) - End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice
Activity Detection [48.80449801938696]
This paper integrates a voice activity detection function with end-to-end automatic speech recognition.
We focus on connectionist temporal classification ( CTC) and its extension ofsynchronous/attention.
We use the labels as a cue for detecting speech segments with simple thresholding.
arXiv Detail & Related papers (2020-02-03T03:36:34Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.