Mamba2 Meets Silence: Robust Vocal Source Separation for Sparse Regions
- URL: http://arxiv.org/abs/2508.14556v1
- Date: Wed, 20 Aug 2025 09:19:11 GMT
- Title: Mamba2 Meets Silence: Robust Vocal Source Separation for Sparse Regions
- Authors: Euiyeon Kim, Yong-Hoon Choi,
- Abstract summary: We introduce a new music source separation model tailored for accurate vocal isolation.<n>Unlike Transformer-based approaches, which often fail to capture intermittently occurring vocals, our model leverages Mamba2, a recent state space model.<n>Experiments show that our approach outperforms recent state-of-the-art models, achieving a cSDR of 11.03 dB.
- Score: 0.9208007322096533
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a new music source separation model tailored for accurate vocal isolation. Unlike Transformer-based approaches, which often fail to capture intermittently occurring vocals, our model leverages Mamba2, a recent state space model, to better capture long-range temporal dependencies. To handle long input sequences efficiently, we combine a band-splitting strategy with a dual-path architecture. Experiments show that our approach outperforms recent state-of-the-art models, achieving a cSDR of 11.03 dB-the best reported to date-and delivering substantial gains in uSDR. Moreover, the model exhibits stable and consistent performance across varying input lengths and vocal occurrence patterns. These results demonstrate the effectiveness of Mamba-based models for high-resolution audio processing and open up new directions for broader applications in audio research.
Related papers
- DiM-TS: Bridge the Gap between Selective State Space Models and Time Series for Generative Modeling [11.836475971106125]
Time series data plays a pivotal role in a wide variety of fields but faces challenges related to privacy concerns.<n>We propose Lag Fusion Mamba and Permutation Scanning Mamba, which enhance the model's ability to discern significant patterns during the denoising process.<n>We also introduce Diffusion Mamba for Time Series (DiM-TS), a high-quality time series generation model that better preserves the temporal periodicity and inter-channel correlations.
arXiv Detail & Related papers (2025-11-23T06:48:03Z) - High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling [65.02357548201188]
We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning.<n>Our framework operates by synthesizing the desired separated sound spectrograms directly from a noise distribution, conditioned concurrently on the mixed audio input and associated visual information.
arXiv Detail & Related papers (2025-09-26T08:46:00Z) - ESTM: An Enhanced Dual-Branch Spectral-Temporal Mamba for Anomalous Sound Detection [39.234515088121086]
We propose a novel framework, ESTM, which is based on a dual-path Mamba architecture with time-frequency decoupled modeling.<n> ESTM extracts rich feature representations from different time segments and frequency bands by fusing enhanced Mel spectrograms and raw audio features.<n>Our experiments demonstrate that ESTM improves anomalous detection performance on the DCASE 2020 Task 2 dataset.
arXiv Detail & Related papers (2025-09-02T16:23:49Z) - Routing Mamba: Scaling State Space Models with Mixture-of-Experts Projection [88.47928738482719]
Linear State Space Models (SSMs) offer remarkable performance gains in sequence modeling.<n>Recent advances, such as Mamba, further enhance SSMs with input-dependent gating and hardware-aware implementations.<n>We introduce Routing Mamba (RoM), a novel approach that scales SSM parameters using sparse mixtures of linear projection experts.
arXiv Detail & Related papers (2025-06-22T19:26:55Z) - Wave-U-Mamba: An End-To-End Framework For High-Quality And Efficient Speech Super Resolution [4.495657539150699]
Speech Super-Resolution (SSR) is a task of enhancing low-resolution speech signals by restoring missing high-frequency components.<n>Conventional approaches typically reconstruct log-mel features, followed by a vocoder that generates high-resolution speech in the waveform domain.<n>We propose a method, referred to as Wave-U-Mamba, that directly performs SSR in time domain.
arXiv Detail & Related papers (2024-09-14T06:52:00Z) - ReMamba: Equip Mamba with Effective Long-Sequence Modeling [50.530839868893786]
We propose ReMamba, which enhances Mamba's ability to comprehend long contexts.<n>ReMamba incorporates selective compression and adaptation techniques within a two-stage re-forward process.
arXiv Detail & Related papers (2024-08-28T02:47:27Z) - SPMamba: State-space model is all you need in speech separation [20.168153319805665]
CNN-based speech separation models face local receptive field limitations and cannot effectively capture long time dependencies.
We introduce an innovative speech separation method called SPMamba.
This model builds upon the robust TF-GridNet architecture, replacing its traditional BLSTM modules with bidirectional Mamba modules.
arXiv Detail & Related papers (2024-04-02T16:04:31Z) - Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism.
The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders.
Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z) - DPATD: Dual-Phase Audio Transformer for Denoising [25.097894984130733]
We propose a dual-phase audio transformer for denoising (DPATD), a novel model to organize transformer layers in a deep structure to learn clean audio sequences for denoising.
Our memory-compressed explainable attention is efficient and converges faster compared to the frequently used self-attention module.
arXiv Detail & Related papers (2023-10-30T14:44:59Z) - Separate And Diffuse: Using a Pretrained Diffusion Model for Improving
Source Separation [99.19786288094596]
We show how the upper bound can be generalized to the case of random generative models.
We show state-of-the-art results on 2, 3, 5, 10, and 20 speakers on multiple benchmarks.
arXiv Detail & Related papers (2023-01-25T18:21:51Z) - End-To-End Audiovisual Feature Fusion for Active Speaker Detection [7.631698269792165]
This work presents a novel two-stream end-to-end framework fusing features extracted from images via VGG-M with raw Mel Frequency Cepstrum Coefficients features extracted from the audio waveform.
Our best-performing model attained 88.929% accuracy, nearly the same detection result as state-of-the-art -work.
arXiv Detail & Related papers (2022-07-27T10:25:59Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.