Related papers: RescueSpeech: A German Corpus for Speech Recognition in Search and Rescue Domain

Related papers

Enabling Automatic Disordered Speech Recognition: An Impaired Speech Dataset in the Akan Language [0.0]
This study presents a curated corpus of speech samples from native Akan speakers with speech impairment.<n>The dataset comprises of 50.01 hours of audio recordings cutting across four classes of impaired speech namely stammering, cerebral palsy, cleft palate, and stroke induced speech disorder.
arXiv Detail & Related papers (2026-02-05T07:44:13Z)
End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering [33.675277272634666]
CLSR is an end-to-end contrastive language-speech retriever.<n>It efficiently extracts question-relevant segments from long audio recordings for downstream SQA task.
arXiv Detail & Related papers (2025-11-12T12:49:30Z)
MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance [66.74042564585942]
MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance.<n>Our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
arXiv Detail & Related papers (2025-10-01T04:32:37Z)
Variational Low-Rank Adaptation for Personalized Impaired Speech Recognition [8.838919369202525]
Speech impairments resulting from congenital disorders present major challenges to automatic speech recognition systems.<n>State-of-the-art ASR models like Whisper still struggle with non-normative speech due to limited training data availability and high acoustic variability.<n>This work introduces a novel ASR personalization method based on Bayesian Low-rank Adaptation for data-efficient fine-tuning.
arXiv Detail & Related papers (2025-09-23T13:44:58Z)
Adapting Foundation Speech Recognition Models to Impaired Speech: A Semantic Re-chaining Approach for Personalization of German Speech [0.562479170374811]
Speech impairments caused by conditions such as cerebral palsy or genetic disorders pose significant challenges for automatic speech recognition systems.<n>We propose a practical and lightweight pipeline to personalize ASR models, formalizing the selection of words and enriching a small, speech-impaired dataset with semantic coherence.<n>Our approach shows promising improvements in transcription quality, demonstrating the potential to reduce communication barriers for individuals with atypical speech patterns.
arXiv Detail & Related papers (2025-06-23T15:30:50Z)
Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs [41.088390995105826]
Speech-to-speech translation (S2ST) has been advanced with large language models (LLMs)<n>LLMs are trained on text-only data, which presents challenges to adapt them to speech modality with limited speech-to-speech data.<n>We propose scheduled interleaved speech--text training in this study.
arXiv Detail & Related papers (2025-06-12T02:24:44Z)
Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR [18.701864254184308]
We combine rhythm and voice conversion methods based on self-supervised speech representations to map dysarthric to typical speech. We find that the proposed rhythm conversion especially improves performance for speakers of the Torgo corpus with more severe cases of dysarthria.
arXiv Detail & Related papers (2025-01-17T15:39:21Z)
Speech Retrieval-Augmented Generation without Automatic Speech Recognition [4.731446054087683]
SpeechRAG is a novel framework designed for open-question answering over spoken data. Our proposed approach fine-tunes a pre-trained speech encoder into a speech adapter fed into a frozen large language model. By aligning the embedding spaces of text and speech, our speech retriever directly retrieves audio passages from text-based queries.
arXiv Detail & Related papers (2024-12-21T06:16:04Z)
Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems. We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems. We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z)
Learning Speech Representation From Contrastive Token-Acoustic Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space. The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z)
SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z)
Direction-Aware Joint Adaptation of Neural Speech Enhancement and Recognition in Real Multiparty Conversational Environments [21.493664174262737]
This paper describes noisy speech recognition for an augmented reality headset that helps verbal communication within real multiparty conversational environments. We propose a semi-supervised adaptation method that jointly updates the mask estimator and the ASR model at run-time using clean speech signals with ground-truth transcriptions and noisy speech signals with highly-confident estimated transcriptions.
arXiv Detail & Related papers (2022-07-15T03:43:35Z)
Curriculum optimization for low-resource speech recognition [4.803994937990389]
We propose an automated curriculum learning approach to optimize the sequence of training examples. We introduce a new difficulty measure called compression ratio that can be used as a scoring function for raw audio in various noise conditions.
arXiv Detail & Related papers (2022-02-17T19:47:50Z)
Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker. We generate the mel-spectrogram of the edited speech with a transformer-based decoder. It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
Streaming Multi-talker Speech Recognition with Joint Speaker Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification. We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z)
Silent versus modal multi-speaker speech recognition from ultrasound and video [43.919073642794324]
We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips. We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech. We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode mismatch between training and testing.
arXiv Detail & Related papers (2021-02-27T21:34:48Z)
Generative Adversarial Training Data Adaptation for Very Low-resource Automatic Speech Recognition [31.808145263757105]
We use CycleGAN-based non-parallel voice conversion technology to forge a labeled training data that is close to the test speaker's speech. We evaluate this speaker adaptation approach on two low-resource corpora, namely, Ainu and Mboshi.
arXiv Detail & Related papers (2020-05-19T07:35:14Z)
Adversarial Feature Learning and Unsupervised Clustering based Speech Synthesis for Found Data with Acoustic and Textual Noise [18.135965605011105]
Attention-based sequence-to-sequence (seq2seq) speech synthesis has achieved extraordinary performance. A studio-quality corpus with manual transcription is necessary to train such seq2seq systems. We propose an approach to build high-quality and stable seq2seq based speech synthesis system using challenging found data.
arXiv Detail & Related papers (2020-04-28T15:32:45Z)
Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components. This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.