HARNESS: Lightweight Distilled Arabic Speech Foundation Models
- URL: http://arxiv.org/abs/2509.14689v1
- Date: Thu, 18 Sep 2025 07:30:37 GMT
- Title: HARNESS: Lightweight Distilled Arabic Speech Foundation Models
- Authors: Vrunda N. sukhadia, Shammur Absar Chowdhury,
- Abstract summary: We introduce HArnESS, the first Arabic-centric self-supervised speech model family.<n>We train large bilingual HArnESS (HL) SSL models and distill knowledge into compressed student models (HS, HST)<n>We use low-rank approximation to further compact the teacher's discrete supervision into shallow, thin models.
- Score: 15.43303906288125
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large pre-trained speech models excel in downstream tasks but their deployment is impractical for resource-limited environments. In this paper, we introduce HArnESS, the first Arabic-centric self-supervised speech model family, designed to capture Arabic speech nuances. Using iterative self-distillation, we train large bilingual HArnESS (HL) SSL models and then distill knowledge into compressed student models (HS, HST), preserving Arabic-specific representations. We use low-rank approximation to further compact the teacher's discrete supervision into shallow, thin models. We evaluate HArnESS on Arabic ASR, Speaker Emotion Recognition (SER), and Dialect Identification (DID), demonstrating effectiveness against HuBERT and XLS-R. With minimal fine-tuning, HArnESS achieves SOTA or comparable performance, making it a lightweight yet powerful alternative for real-world use. We release our distilled models and findings to support responsible research and deployment in low-resource settings.
Related papers
- Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning [0.0]
We employ weakly supervised learning to train an Arabic ASR model using the Conformer architecture.<n>Our model is trained from scratch on 15,000 hours of weakly annotated speech data covering both Modern Standard Arabic (MSA) and Dialectal Arabic (DA)
arXiv Detail & Related papers (2025-04-16T17:05:14Z) - How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario [72.02391485962127]
Speech Self-Supervised Learning (SSL) models achieve impressive performance on Automatic Speech Recognition (ASR)<n>In low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages.<n>We extend a conventional efficient fine-tuning scheme based on the adapter to handle these issues.
arXiv Detail & Related papers (2024-11-27T10:51:00Z) - Self-supervised Speech Representations Still Struggle with African American Vernacular English [28.223877889211803]
Underperformance of ASR systems for speakers of marginalized language varieties is a well-documented phenomenon.
We investigate whether or not the recent wave of Self-Supervised Learning speech models can close the gap in ASR performance between AAVE and Mainstream American English.
arXiv Detail & Related papers (2024-08-26T13:29:25Z) - Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised Models [69.59613095232598]
We propose adaptation methods which integrate LoRA to existed SSL models to extend new language.<n>We also develop preservation strategies which include data combination and re-clustering to retain abilities on existed languages.
arXiv Detail & Related papers (2024-06-20T08:13:30Z) - Efficient Compression of Multitask Multilingual Speech Models [0.0]
DistilWhisper is able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities.
Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2.
arXiv Detail & Related papers (2024-05-02T03:11:59Z) - Arabic Tweet Act: A Weighted Ensemble Pre-Trained Transformer Model for
Classifying Arabic Speech Acts on Twitter [0.32885740436059047]
This paper proposes a Twitter dialectal Arabic speech act classification approach based on a transformer deep learning neural network.
We proposed a BERT based weighted ensemble learning approach to integrate the advantages of various BERT models in dialectal Arabic speech acts classification.
The results show that the best BERT model is araBERTv2-Twitter models with a macro-averaged F1 score and an accuracy of 0.73 and 0.84, respectively.
arXiv Detail & Related papers (2024-01-30T19:01:24Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model [16.31307448314024]
We propose DistilXLSR, a distilled cross-lingual speech representation model.
By randomly shuffling the phonemes of existing speech, we reduce the linguistic information and distill cross-lingual models using only English data.
Our method is proven to be generalizable to various languages/teacher models and has the potential to improve the cross-lingual performance of the English pre-trained models.
arXiv Detail & Related papers (2023-06-02T07:03:06Z) - ML-SUPERB: Multilingual Speech Universal PERformance Benchmark [94.64616634862995]
Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks.<n>This paper presents multilingual SUPERB, covering 143 languages (ranging from high-resource to endangered), and considering both automatic speech recognition and language identification.<n>Similar to the SUPERB benchmark, we find speech SSL models can significantly improve performance compared to FBANK features.
arXiv Detail & Related papers (2023-05-18T00:01:27Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.