Exploring State-Space-Model based Language Model in Music Generation
- URL: http://arxiv.org/abs/2507.06674v1
- Date: Wed, 09 Jul 2025 09:05:18 GMT
- Title: Exploring State-Space-Model based Language Model in Music Generation
- Authors: Wei-Jaw Lee, Fang-Chih Hsieh, Xuanjun Chen, Fang-Duo Tsai, Yi-Hsuan Yang,
- Abstract summary: We explore the potential of Mamba-based architectures for text-to-music generation.<n>We adapt SiMBA, originally designed as a Mamba-based encoder, to function as a decoder for sequence modeling.<n>Our results suggest that, under limited-resource settings, SiMBA achieves much faster convergence and generates outputs closer to the ground truth.
- Score: 12.697065688262521
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent surge in State Space Models (SSMs), particularly the emergence of Mamba, has established them as strong alternatives or complementary modules to Transformers across diverse domains. In this work, we aim to explore the potential of Mamba-based architectures for text-to-music generation. We adopt discrete tokens of Residual Vector Quantization (RVQ) as the modeling representation and empirically find that a single-layer codebook can capture semantic information in music. Motivated by this observation, we focus on modeling a single-codebook representation and adapt SiMBA, originally designed as a Mamba-based encoder, to function as a decoder for sequence modeling. We compare its performance against a standard Transformer-based decoder. Our results suggest that, under limited-resource settings, SiMBA achieves much faster convergence and generates outputs closer to the ground truth. This demonstrates the promise of SSMs for efficient and expressive text-to-music generation. We put audio examples on Github.
Related papers
- MambaVesselNet++: A Hybrid CNN-Mamba Architecture for Medical Image Segmentation [21.20366935690067]
We propose MambaVesselNet++, a Hybrid CNN-Mamba framework for medical image segmentation.<n>MambaVesselNet++ is comprised of a hybrid image encoder (Hi-Encoder) and a bifocal fusion decoder (BF-Decoder)
arXiv Detail & Related papers (2025-07-26T12:32:59Z) - An Exploration of Mamba for Speech Self-Supervised Models [48.01992287080999]
We explore Mamba-based HuBERT models as alternatives to Transformer-based SSL architectures.<n>HuBERT models enable fine-tuning on long-context ASR with significantly lower compute.<n>These findings highlight Mamba-based SSL as a promising and complementary direction for long-sequence modeling, real-time speech modeling, and speech unit extraction.
arXiv Detail & Related papers (2025-06-14T19:00:44Z) - Mamba-SEUNet: Mamba UNet for Monaural Speech Enhancement [54.427965535613886]
Mamba, as a novel state-space model (SSM), has gained widespread application in natural language processing and computer vision.<n>In this work, we introduce Mamba-SEUNet, an innovative architecture that integrates Mamba with U-Net for SE tasks.
arXiv Detail & Related papers (2024-12-21T13:43:51Z) - UniMuMo: Unified Text, Music and Motion Generation [57.72514622935806]
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities.
By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture.
arXiv Detail & Related papers (2024-10-06T16:04:05Z) - Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs)<n>Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens.<n>Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z) - Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation [16.298890431384564]
We introduce Sigma, a Siamese Mamba network for multi-modal semantic segmentation utilizing the advanced Mamba.
By employing a Siamese encoder and innovating a Mamba-based fusion mechanism, we effectively select essential information from different modalities.
Our proposed method is rigorously evaluated on both RGB-Thermal and RGB-Depth semantic segmentation tasks.
arXiv Detail & Related papers (2024-04-05T17:59:44Z) - SPMamba: State-space model is all you need in speech separation [20.168153319805665]
CNN-based speech separation models face local receptive field limitations and cannot effectively capture long time dependencies.
We introduce an innovative speech separation method called SPMamba.
This model builds upon the robust TF-GridNet architecture, replacing its traditional BLSTM modules with bidirectional Mamba modules.
arXiv Detail & Related papers (2024-04-02T16:04:31Z) - PointMamba: A Simple State Space Model for Point Cloud Analysis [65.59944745840866]
We propose PointMamba, transferring the success of Mamba, a recent representative state space model (SSM), from NLP to point cloud analysis tasks.
Unlike traditional Transformers, PointMamba employs a linear complexity algorithm, presenting global modeling capacity while significantly reducing computational costs.
arXiv Detail & Related papers (2024-02-16T14:56:13Z) - MambaByte: Token-free Selective State Space Model [71.90159903595514]
MambaByte is a token-free adaptation of the Mamba SSM trained autoregressively on byte sequences.
We show MambaByte to be competitive with, and even to outperform, state-of-the-art subword Transformers on language modeling tasks.
arXiv Detail & Related papers (2024-01-24T18:53:53Z) - Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module.
We identify that a key weakness of such models is their inability to perform content-based reasoning.
We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba)
As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.