xLSTM-SENet: xLSTM for Single-Channel Speech Enhancement
- URL: http://arxiv.org/abs/2501.06146v1
- Date: Fri, 10 Jan 2025 18:10:06 GMT
- Title: xLSTM-SENet: xLSTM for Single-Channel Speech Enhancement
- Authors: Nikolai Lund Kühne, Jan Østergaard, Jesper Jensen, Zheng-Hua Tan,
- Abstract summary: This paper introduces xLSTM-SENet, the first xLSTM-based single-channel speech enhancement system.
Our best xLSTM-based model, xLSTM-SENet2, outperforms state-of-the-art Mamba- and Conformer-based systems on the Voicebank+DEMAND dataset.
- Score: 19.76560732937885
- License:
- Abstract: While attention-based architectures, such as Conformers, excel in speech enhancement, they face challenges such as scalability with respect to input sequence length. In contrast, the recently proposed Extended Long Short-Term Memory (xLSTM) architecture offers linear scalability. However, xLSTM-based models remain unexplored for speech enhancement. This paper introduces xLSTM-SENet, the first xLSTM-based single-channel speech enhancement system. A comparative analysis reveals that xLSTM-and notably, even LSTM-can match or outperform state-of-the-art Mamba- and Conformer-based systems across various model sizes in speech enhancement on the VoiceBank+Demand dataset. Through ablation studies, we identify key architectural design choices such as exponential gating and bidirectionality contributing to its effectiveness. Our best xLSTM-based model, xLSTM-SENet2, outperforms state-of-the-art Mamba- and Conformer-based systems on the Voicebank+DEMAND dataset.
Related papers
- Large Generative Model-assisted Talking-face Semantic Communication System [55.42631520122753]
This study introduces a Large Generative Model-assisted Talking-face Semantic Communication (LGM-TSC) system.
Generative Semantic Extractor (GSE) at the transmitter converts semantically sparse talking-face videos into texts with high information density.
Private Knowledge Base (KB) based on the Large Language Model (LLM) for semantic disambiguation and correction.
Generative Semantic Reconstructor (GSR) that utilizes BERT-VITS2 and SadTalker models to transform text back into a high-QoE talking-face video.
arXiv Detail & Related papers (2024-11-06T12:45:46Z) - xLSTM-UNet can be an Effective 2D & 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart [13.812935743270517]
We propose xLSTM-UNet, a UNet structured deep learning neural network that leverages Vision-LSTM (xLSTM) as its backbone for medical image segmentation.
xLSTM is a recently proposed as the successor of Long Short-Term Memory (LSTM) networks.
Our findings demonstrate that xLSTM-UNet consistently surpasses the performance of leading CNN-based, Transformer-based, and Mamba-based segmentation networks.
arXiv Detail & Related papers (2024-07-01T17:59:54Z) - Seg-LSTM: Performance of xLSTM for Semantic Segmentation of Remotely Sensed Images [1.5954224931801726]
This study is the first attempt to evaluate the effectiveness of Vision-LSTM in the semantic segmentation of remotely sensed images.
Our study found that Vision-LSTM's performance in semantic segmentation was limited and generally inferior to Vision-Transformers-based and Vision-Mamba-based models in most comparative tests.
arXiv Detail & Related papers (2024-06-20T08:01:28Z) - xLSTM: Extended Long Short-Term Memory [26.607656211983155]
In the 1990s, constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM)
We introduce exponential gating with appropriate normalization and stabilization techniques.
We modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule.
arXiv Detail & Related papers (2024-05-07T17:50:21Z) - An Embarrassingly Simple Approach for LLM with Strong ASR Capacity [56.30595787061546]
We focus on solving one of the most important tasks in the field of speech processing, with speech foundation encoders and large language models (LLM)
Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM.
We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task.
arXiv Detail & Related papers (2024-02-13T23:25:04Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - LSTM-LM with Long-Term History for First-Pass Decoding in Conversational
Speech Recognition [27.639919625398]
LSTM language models (LSTM-LMs) have been proven to be powerful and yielded significant performance improvements over count based n-gram LMs in modern speech recognition systems.
Recent work shows that it is feasible and computationally affordable to adopt the LSTM-LMs in the first-pass decoding within a dynamic (or tree based) decoder framework.
arXiv Detail & Related papers (2020-10-21T23:40:26Z) - Future Vector Enhanced LSTM Language Model for LVCSR [67.03726018635174]
This paper proposes a novel enhanced long short-term memory (LSTM) LM using the future vector.
Experiments show that, the proposed new LSTM LM gets a better result on BLEU scores for long term sequence prediction.
Rescoring using both the new and conventional LSTM LMs can achieve a very large improvement on the word error rate.
arXiv Detail & Related papers (2020-07-31T08:38:56Z) - Multi-view Frequency LSTM: An Efficient Frontend for Automatic Speech
Recognition [4.753402561130792]
We present a simple and efficient modification by combining the outputs of multiple FLSTM stacks with different views.
We show that the multi-view FLSTM acoustic model provides relative Word Error Rate (WER) improvements of 3-7% for different speaker and acoustic environment scenarios.
arXiv Detail & Related papers (2020-06-30T22:19:53Z) - Depth-Adaptive Graph Recurrent Network for Text Classification [71.20237659479703]
Sentence-State LSTM (S-LSTM) is a powerful and high efficient graph recurrent network.
We propose a depth-adaptive mechanism for the S-LSTM, which allows the model to learn how many computational steps to conduct for different words as required.
arXiv Detail & Related papers (2020-02-29T03:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.