SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation
- URL: http://arxiv.org/abs/2505.03273v2
- Date: Mon, 26 May 2025 07:01:19 GMT
- Title: SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation
- Authors: Zhaoxi Mu, Xinyu Yang, Gang Wang,
- Abstract summary: We introduce SepALM, a pioneering approach that employs audio language models (ALMs) to rectify and re-synthesize speech within the text domain following preliminary separation.<n>SepALM comprises four core components: a separator, a corrector, a synthesizer, and an aligner.<n>Our experiments substantiate that SepALM not only elevates the precision of speech separation but also markedly bolsters adaptability in novel acoustic environments.
- Score: 15.58921460046093
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While contemporary speech separation technologies adeptly process lengthy mixed audio waveforms, they are frequently challenged by the intricacies of real-world environments, including noisy and reverberant settings, which can result in artifacts or distortions in the separated speech. To overcome these limitations, we introduce SepALM, a pioneering approach that employs audio language models (ALMs) to rectify and re-synthesize speech within the text domain following preliminary separation. SepALM comprises four core components: a separator, a corrector, a synthesizer, and an aligner. By integrating an ALM-based end-to-end error correction mechanism, we mitigate the risk of error accumulation and circumvent the optimization hurdles typically encountered in conventional methods that amalgamate automatic speech recognition (ASR) with large language models (LLMs). Additionally, we have developed Chain-of-Thought (CoT) prompting and knowledge distillation techniques to facilitate the reasoning and training processes of the ALM. Our experiments substantiate that SepALM not only elevates the precision of speech separation but also markedly bolsters adaptability in novel acoustic environments.
Related papers
- ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models [70.56468982313834]
We propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody.<n>We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone.
arXiv Detail & Related papers (2025-07-27T00:59:01Z) - LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization [8.365515332927444]
Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models.<n>We propose LM-SPT, a speech tokenization method that introduces a novel semantic distillation.<n>We show that LM-SPT achieves superior reconstruction fidelity compared to baselines.
arXiv Detail & Related papers (2025-06-20T04:15:14Z) - From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z) - SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline [29.85417427778784]
SoloSpeech is a cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes.<n>It achieves the new state-of-the-art intelligibility and quality in target speech extraction and speech separation tasks.
arXiv Detail & Related papers (2025-05-25T21:00:48Z) - It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - DDTSE: Discriminative Diffusion Model for Target Speech Extraction [62.422291953387955]
We introduce the Discriminative Diffusion model for Target Speech Extraction (DDTSE)
We apply the same forward process as diffusion models and utilize the reconstruction loss similar to discriminative methods.
We devise a two-stage training strategy to emulate the inference process during model training.
arXiv Detail & Related papers (2023-09-25T04:58:38Z) - Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for Meeting Transcription [31.774032625780414]
TF-GridNet has shown impressive performance in speech separation in real reverberant conditions.<n>We extend the mixture encoder from a static two-speaker scenario to a natural meeting context.<n>Experiments result in a new state-of-the-art performance on LibriCSS using a single microphone.
arXiv Detail & Related papers (2023-09-15T14:57:28Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - Mixture Encoder for Joint Speech Separation and Recognition [15.13598115379631]
Multi-speaker automatic speech recognition is crucial for many real-world applications.
Existing approaches can be divided into modular and end-to-end methods.
End-to-end models process overlapped speech directly in a single, powerful neural network.
arXiv Detail & Related papers (2023-06-21T11:01:31Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Integrated Semantic and Phonetic Post-correction for Chinese Speech
Recognition [1.2914521751805657]
We propose a novel approach to collectively exploit the contextualized representation and the phonetic information between the error and its replacing candidates to alleviate the error rate of Chinese ASR.
Our experiment results on real world speech recognition showed that our proposed method has evidently lower than the baseline model.
arXiv Detail & Related papers (2021-11-16T11:55:27Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Statistical Context-Dependent Units Boundary Correction for Corpus-based
Unit-Selection Text-to-Speech [1.4337588659482519]
We present an innovative technique for speaker adaptation in order to improve the accuracy of segmentation with application to unit-selection Text-To-Speech (TTS) systems.
Unlike conventional techniques for speaker adaptation, we aim to use only context dependent characteristics extrapolated with linguistic analysis techniques.
arXiv Detail & Related papers (2020-03-05T12:42:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.