DUO-TOK: Dual-Track Semantic Music Tokenizer for Vocal-Accompaniment Generation
- URL: http://arxiv.org/abs/2511.20224v1
- Date: Tue, 25 Nov 2025 11:53:57 GMT
- Title: DUO-TOK: Dual-Track Semantic Music Tokenizer for Vocal-Accompaniment Generation
- Authors: Rui Lin, Zhiyue Wu, Jiahe Le, Kangdi Wang, Weixiong Chen, Junyu Dai, Tao Jiang,
- Abstract summary: Duo-Tok is a source-aware dual-codebook tokenizer for vocal-accompaniment music.<n>It targets the growing tension between reconstruction quality and language-model (LM) learnability in modern lyrics-to-song systems.
- Score: 3.5346639529821435
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Duo-Tok is a source-aware dual-codebook tokenizer for vocal-accompaniment music that targets the growing tension between reconstruction quality and language-model (LM) learnability in modern lyrics-to-song systems. Existing codecs either prioritize high-fidelity reconstruction with difficult-to-model acoustic tokens or compress aggressively into semantic tokens that are LM-friendly but lossy, and they rarely make the tokenizer itself aware of dual-track structure. Duo-Tok follows a four-stage, SSL-centered pipeline: we first pretrain a BEST-RQ-style encoder on large-scale audio, then stabilize and factorize the representation with Gaussian replacement noise and multi-task supervision, before freezing the encoder to learn SimVQ-based dual codebooks with hard routing for vocals and accompaniment, and finally training latent diffusion decoders on top of the discrete tokens. Duo-Tok at 0.75 kbps shifts the empirical reconstruction-generation Pareto frontier, achieving the best music-tagging AP and the lowest vocabulary-normalized LM perplexity among compared codecs while maintaining reconstruction quality comparable to state-of-the-art music tokenizers.
Related papers
- VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction [83.50898344094153]
VQRAE produces Continuous semantic features for image understanding and tokens for visual generation within a unified tokenizer.<n>Design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens.<n>VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction.
arXiv Detail & Related papers (2025-11-28T17:26:34Z) - CoDiCodec: Unifying Continuous and Discrete Compressed Representations of Audio [7.093237513313511]
CoDiCodec is a novel audio autoencoder that overcomes limitations by both efficiently encoding global features via summary embeddings.<n>It produces both compressed continuous embeddings at 11 Hz and discrete tokens at a rate of 2.38 kbps from the same trained model.<n>Our work enables a unified approach to audio compression, bridging the gap between continuous and discrete generative modelling paradigms.
arXiv Detail & Related papers (2025-09-11T20:31:18Z) - Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates [1.445167946386569]
We show that Finite Scalar Quantization (FSQ) encodes baked-in redundancy which produces an encoding which is robust when transmitted through noisy channels.<n>We demonstrate that FSQ has vastly superior bit-level perturbation by comparing the performance of RVQ and FSQ codecs when simulating the transmission of code sequences through a noisy channel.
arXiv Detail & Related papers (2025-09-11T15:39:59Z) - SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec [83.61175662066364]
Speech codecs serve as a crucial bridge in unifying speech and text language models.<n>Existing methods face several challenges in semantic encoding.<n>We propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codecs.
arXiv Detail & Related papers (2025-08-04T19:22:14Z) - LeVo: High-Quality Song Generation with Multi-Preference Alignment [47.965028296133426]
We introduce LeVo, a language model based framework consisting of LeLM and Music Codec.<n>LeVo is capable of parallel modeling of two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment.<n>It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types.
arXiv Detail & Related papers (2025-06-09T07:57:24Z) - FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing [81.3306413498174]
Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects.<n>Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality.<n>We propose a large language model (LLM) based flow matching architecture for dubbing, named FlowDubber.
arXiv Detail & Related papers (2025-05-02T13:30:19Z) - Music2Latent2: Audio Compression with Summary Embeddings and Autoregressive Decoding [0.0]
We introduce Music2Latent2, a novel audio autoencoder that compresses audio signals into sets of summary embeddings.<n>Unlike conventional methods that encode local audio features into ordered sequences, Music2Latent2 compresses audio signals into sets of summary embeddings.<n>To handle arbitrary audio lengths, Music2Latent2 employs an autoregressive consistency model trained on two consecutive audio chunks with causal masking.
arXiv Detail & Related papers (2025-01-29T11:34:19Z) - WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [63.8735398698683]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens.<n>We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain.<n>WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z) - Music2Latent: Consistency Autoencoders for Latent Audio Compression [0.0]
We introduce Music2Latent, an audio autoencoder that overcomes limitations by leveraging consistency models.
Music2Latent encodes samples into a compressed continuous latent space in a single end-to-end training process.
We demonstrate that Music2Latent outperforms existing continuous audio autoencoders in sound quality and reconstruction accuracy.
arXiv Detail & Related papers (2024-08-12T21:25:19Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.