Related papers: AS-ASR: A Lightweight Framework for Aphasia-Specific Automatic Speech Recognition

AS-ASR: A Lightweight Framework for Aphasia-Specific Automatic Speech Recognition

URL: http://arxiv.org/abs/2506.06566v1
Date: Fri, 06 Jun 2025 22:38:53 GMT
Title: AS-ASR: A Lightweight Framework for Aphasia-Specific Automatic Speech Recognition
Authors: Chen Bao, Chuanbing Huo, Qinyu Chen, Chang Gao,
Abstract summary: AS-ASR is a lightweight aphasia-specific speech recognition framework based on Whisper-tiny.<n>Our approach systematically combines standard and aphasic speech at varying ratios, enabling robust generalization.
Score: 4.70623940988391
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper proposes AS-ASR, a lightweight aphasia-specific speech recognition framework based on Whisper-tiny, tailored for low-resource deployment on edge devices. Our approach introduces a hybrid training strategy that systematically combines standard and aphasic speech at varying ratios, enabling robust generalization, and a GPT-4-based reference enhancement method that refines noisy aphasic transcripts, improving supervision quality. We conduct extensive experiments across multiple data mixing configurations and evaluation settings. Results show that our fine-tuned model significantly outperforms the zero-shot baseline, reducing WER on aphasic speech by over 30% while preserving performance on standard speech. The proposed framework offers a scalable, efficient solution for real-world disordered speech recognition.

Related papers

$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR)<n>MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.<n>To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z)
A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model [14.795953417531907]
We propose a unified multichannel far-field speech recognition system that combines the neural beamforming and transformer-based Listen, Spell, Attend (LAS) speech recognition system. The proposed method achieve 19.26% improvement when compared with a strong baseline.
arXiv Detail & Related papers (2024-01-05T07:11:13Z)
Advancing Test-Time Adaptation in Wild Acoustic Test Settings [26.05732574338255]
Speech signals follow short-term consistency, requiring specialized adaptation strategies. We propose a novel wild acoustic TTA method tailored for ASR fine-tuned acoustic foundation models. Our approach outperforms existing baselines under various wild acoustic test settings.
arXiv Detail & Related papers (2023-10-14T06:22:08Z)
DDTSE: Discriminative Diffusion Model for Target Speech Extraction [62.422291953387955]
We introduce the Discriminative Diffusion model for Target Speech Extraction (DDTSE) We apply the same forward process as diffusion models and utilize the reconstruction loss similar to discriminative methods. We devise a two-stage training strategy to emulate the inference process during model training.
arXiv Detail & Related papers (2023-09-25T04:58:38Z)
High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units [69.06657692891447]
We propose a novel AVO method leveraging the learning objective of self-supervised discrete speech unit prediction. Experimental results show that our proposed method achieves remarkable lip-speech synchronization and high speech quality.
arXiv Detail & Related papers (2023-06-29T15:02:22Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem. Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols. By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z)
Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments. We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition. We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z)
Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR. The GRF algorithm is used to dynamically combine the noisy and enhanced features. The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z)
An Effective Contextual Language Modeling Framework for Speech Summarization with Augmented Features [13.97006782398121]
Bidirectional Representations from Transformers (BERT) model was proposed and has achieved record-breaking success on many natural language processing tasks. We explore the incorporation of confidence scores into sentence representations to see if such an attempt could help alleviate the negative effects caused by imperfect automatic speech recognition. We validate the effectiveness of our proposed method on a benchmark dataset.
arXiv Detail & Related papers (2020-06-01T18:27:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.