Related papers: DUO-TOK: Dual-Track Semantic Music Tokenizer for Vocal-Accompaniment Generation

DUO-TOK: Dual-Track Semantic Music Tokenizer for Vocal-Accompaniment Generation

URL: http://arxiv.org/abs/2511.20224v1
Date: Tue, 25 Nov 2025 11:53:57 GMT
Title: DUO-TOK: Dual-Track Semantic Music Tokenizer for Vocal-Accompaniment Generation
Authors: Rui Lin, Zhiyue Wu, Jiahe Le, Kangdi Wang, Weixiong Chen, Junyu Dai, Tao Jiang,
Abstract summary: Duo-Tok is a source-aware dual-codebook tokenizer for vocal-accompaniment music.<n>It targets the growing tension between reconstruction quality and language-model (LM) learnability in modern lyrics-to-song systems.
Score: 3.5346639529821435
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Duo-Tok is a source-aware dual-codebook tokenizer for vocal-accompaniment music that targets the growing tension between reconstruction quality and language-model (LM) learnability in modern lyrics-to-song systems. Existing codecs either prioritize high-fidelity reconstruction with difficult-to-model acoustic tokens or compress aggressively into semantic tokens that are LM-friendly but lossy, and they rarely make the tokenizer itself aware of dual-track structure. Duo-Tok follows a four-stage, SSL-centered pipeline: we first pretrain a BEST-RQ-style encoder on large-scale audio, then stabilize and factorize the representation with Gaussian replacement noise and multi-task supervision, before freezing the encoder to learn SimVQ-based dual codebooks with hard routing for vocals and accompaniment, and finally training latent diffusion decoders on top of the discrete tokens. Duo-Tok at 0.75 kbps shifts the empirical reconstruction-generation Pareto frontier, achieving the best music-tagging AP and the lowest vocabulary-normalized LM perplexity among compared codecs while maintaining reconstruction quality comparable to state-of-the-art music tokenizers.

Related papers

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction [83.50898344094153]
VQRAE produces Continuous semantic features for image understanding and tokens for visual generation within a unified tokenizer.<n>Design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens.<n>VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction.
arXiv Detail & Related papers (2025-11-28T17:26:34Z)
CoDiCodec: Unifying Continuous and Discrete Compressed Representations of Audio [7.093237513313511]
CoDiCodec is a novel audio autoencoder that overcomes limitations by both efficiently encoding global features via summary embeddings.<n>It produces both compressed continuous embeddings at 11 Hz and discrete tokens at a rate of 2.38 kbps from the same trained model.<n>Our work enables a unified approach to audio compression, bridging the gap between continuous and discrete generative modelling paradigms.
arXiv Detail & Related papers (2025-09-11T20:31:18Z)
Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates [1.445167946386569]
We show that Finite Scalar Quantization (FSQ) encodes baked-in redundancy which produces an encoding which is robust when transmitted through noisy channels.<n>We demonstrate that FSQ has vastly superior bit-level perturbation by comparing the performance of RVQ and FSQ codecs when simulating the transmission of code sequences through a noisy channel.
arXiv Detail & Related papers (2025-09-11T15:39:59Z)
SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec [83.61175662066364]
Speech codecs serve as a crucial bridge in unifying speech and text language models.<n>Existing methods face several challenges in semantic encoding.<n>We propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codecs.
arXiv Detail & Related papers (2025-08-04T19:22:14Z)
LeVo: High-Quality Song Generation with Multi-Preference Alignment [47.965028296133426]
We introduce LeVo, a language model based framework consisting of LeLM and Music Codec.<n>LeVo is capable of parallel modeling of two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment.<n>It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types.
arXiv Detail & Related papers (2025-06-09T07:57:24Z)
FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing [81.3306413498174]
Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects.<n>Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality.<n>We propose a large language model (LLM) based flow matching architecture for dubbing, named FlowDubber.
arXiv Detail & Related papers (2025-05-02T13:30:19Z)
Music2Latent2: Audio Compression with Summary Embeddings and Autoregressive Decoding [0.0]
We introduce Music2Latent2, a novel audio autoencoder that compresses audio signals into sets of summary embeddings.<n>Unlike conventional methods that encode local audio features into ordered sequences, Music2Latent2 compresses audio signals into sets of summary embeddings.<n>To handle arbitrary audio lengths, Music2Latent2 employs an autoregressive consistency model trained on two consecutive audio chunks with causal masking.
arXiv Detail & Related papers (2025-01-29T11:34:19Z)
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [63.8735398698683]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens.<n>We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain.<n>WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z)
Music2Latent: Consistency Autoencoders for Latent Audio Compression [0.0]
We introduce Music2Latent, an audio autoencoder that overcomes limitations by leveraging consistency models. Music2Latent encodes samples into a compressed continuous latent space in a single end-to-end training process. We demonstrate that Music2Latent outperforms existing continuous audio autoencoders in sound quality and reconstruction accuracy.
arXiv Detail & Related papers (2024-08-12T21:25:19Z)
Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes. The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.