Related papers: An Independence-promoting Loss for Music Generation with Language Models

An Independence-promoting Loss for Music Generation with Language Models

URL: http://arxiv.org/abs/2406.02315v2
Date: Sun, 9 Jun 2024 17:55:51 GMT
Title: An Independence-promoting Loss for Music Generation with Language Models
Authors: Jean-Marie Lemercier, Simon Rouard, Jade Copet, Yossi Adi, Alexandre Défossez,
Abstract summary: Music generation schemes rely on a vocabulary of audio tokens, generally provided as codes in a discrete latent space learnt by an auto-encoder. We introduce an independence-promoting loss to regularize the auto-encoder used as the tokenizer in language models for music generation.
Score: 64.95095558672996
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Music generation schemes using language modeling rely on a vocabulary of audio tokens, generally provided as codes in a discrete latent space learnt by an auto-encoder. Multi-stage quantizers are often employed to produce these tokens, therefore the decoding strategy used for token prediction must be adapted to account for multiple codebooks: either it should model the joint distribution over all codebooks, or fit the product of the codebook marginal distributions. Modelling the joint distribution requires a costly increase in the number of auto-regressive steps, while fitting the product of the marginals yields an inexact model unless the codebooks are mutually independent. In this work, we introduce an independence-promoting loss to regularize the auto-encoder used as the tokenizer in language models for music generation. The proposed loss is a proxy for mutual information based on the maximum mean discrepancy principle, applied in reproducible kernel Hilbert spaces. Our criterion is simple to implement and train, and it is generalizable to other multi-stream codecs. We show that it reduces the statistical dependence between codebooks during auto-encoding. This leads to an increase in the generated music quality when modelling the product of the marginal distributions, while generating audio much faster than the joint distribution model.

Related papers

Embedding Alignment in Code Generation for Audio [1.3870914906258829]
LLM-powered code generation has the potential to revolutionize creative coding endeavors, such as live-coding.<n>We present a model that predicts output audio embedding, constructing a code-audio embedding alignment map.
arXiv Detail & Related papers (2025-08-07T15:13:42Z)
SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec [83.61175662066364]
Speech codecs serve as a crucial bridge in unifying speech and text language models.<n>Existing methods face several challenges in semantic encoding.<n>We propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codecs.
arXiv Detail & Related papers (2025-08-04T19:22:14Z)
DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation [68.19756761027351]
Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models.<n>We investigate their denoising processes and reinforcement learning methods.<n>Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework.
arXiv Detail & Related papers (2025-06-25T17:35:47Z)
A correlation-permutation approach for speech-music encoders model merging [80.83944654755022]
We introduce a correlation-permutation approach that aligns a music encoder's internal layers with a speech encoder.<n>The method computes a permutation matrix that maximizes the model's features-wise cross-correlations layer by layer.<n>This work allows the creation of unified audio models from independently trained encoders.
arXiv Detail & Related papers (2025-06-13T02:04:33Z)
Continuous Speculative Decoding for Autoregressive Image Generation [33.05392461723613]
Continuous-valued Autoregressive (AR) image generation models have demonstrated notable superiority over their discrete-token counterparts. speculative decoding has proven effective in accelerating Large Language Models (LLMs) This work generalizes the speculative decoding algorithm from discrete tokens to continuous space.
arXiv Detail & Related papers (2024-11-18T09:19:15Z)
Autoregressive Diffusion Transformer for Text-to-Speech Synthesis [39.32761051774537]
We propose encoding audio as vector sequences in continuous space $mathbb Rd$ and autoregressively generating these sequences. High-bitrate continuous speech representation enables almost flawless reconstruction, allowing our model to achieve nearly perfect speech editing.
arXiv Detail & Related papers (2024-06-08T18:57:13Z)
Triple-Encoders: Representations That Fire Together, Wire Together [51.15206713482718]
Contrastive Learning is a representation learning method that encodes relative distances between utterances into the embedding space via a bi-encoder. This study introduces triple-encoders, which efficiently compute distributed utterance mixtures from these independently encoded utterances. We find that triple-encoders lead to a substantial improvement over bi-encoders, and even to better zero-shot generalization than single-vector representation models.
arXiv Detail & Related papers (2024-02-19T18:06:02Z)
Hierarchical Attention Encoder Decoder [2.4366811507669115]
Autoregressive modeling can generate complex and novel sequences that have many real-world applications. These models must generate outputs autoregressively, which becomes time-consuming when dealing with long sequences. We propose a model based on the Hierarchical Recurrent Decoder architecture.
arXiv Detail & Related papers (2023-06-01T18:17:23Z)
Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space. We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z)
Sparse Coding with Multi-Layer Decoders using Variance Regularization [19.8572592390623]
We propose a novel sparse coding protocol which prevents a collapse in the codes without the need to regularize the decoder. Our method regularizes the codes directly so that each latent code component has variance greater than a fixed threshold. We show that sparse autoencoders with multi-layer decoders trained using our variance regularization method produce higher quality reconstructions with sparser representations.
arXiv Detail & Related papers (2021-12-16T21:46:23Z)
End-to-end Sinkhorn Autoencoder with Noise Generator [10.008055997630304]
We propose a novel end-to-end sinkhorn autoencoder with noise generator for efficient data collection simulation. Our method outperforms competing approaches on a challenging dataset of simulation data from Zero Degree Calorimeters of ALICE experiment in LHC.
arXiv Detail & Related papers (2020-06-11T18:04:10Z)
Cascaded Text Generation with Markov Transformers [122.76100449018061]
Two dominant approaches to neural text generation are fully autoregressive models, using serial beam search decoding, and non-autoregressive models, using parallel decoding with no output dependencies. This work proposes an autoregressive model with sub-linear parallel time generation. Noting that conditional random fields with bounded context can be decoded in parallel, we propose an efficient cascaded decoding approach for generating high-quality output. This approach requires only a small modification from standard autoregressive training, while showing competitive accuracy/speed tradeoff compared to existing methods on five machine translation datasets.
arXiv Detail & Related papers (2020-06-01T17:52:15Z)
Learning Autoencoders with Relational Regularization [89.53065887608088]
A new framework is proposed for learning autoencoders of data distributions. We minimize the discrepancy between the model and target distributions, with a emphrelational regularization We implement the framework with two scalable algorithms, making it applicable for both probabilistic and deterministic autoencoders.
arXiv Detail & Related papers (2020-02-07T17:27:30Z)
AvgOut: A Simple Output-Probability Measure to Eliminate Dull Responses [97.50616524350123]
We build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering. The first model, MinAvgOut, directly maximizes the diversity score through the output distributions of each batch. The second model, Label Fine-Tuning (LFT), prepends to the source sequence a label continuously scaled by the diversity score to control the diversity level. The third model, RL, adopts Reinforcement Learning and treats the diversity score as a reward signal.
arXiv Detail & Related papers (2020-01-15T18:32:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.