MARS: Audio Generation via Multi-Channel Autoregression on Spectrograms
- URL: http://arxiv.org/abs/2509.26007v1
- Date: Tue, 30 Sep 2025 09:38:02 GMT
- Title: MARS: Audio Generation via Multi-Channel Autoregression on Spectrograms
- Authors: Eleonora Ristori, Luca Bindini, Paolo Frasconi,
- Abstract summary: We introduce MARS (Multi-channel AutoRegression on Spectrograms), a framework that treats spectrograms as multi-channel images.<n>A shared tokenizer provides consistent discrete representations across scales, enabling a transformer-based autoregressor to refine spectrograms efficiently.
- Score: 0.8258451067861929
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Research on audio generation has progressively shifted from waveform-based approaches to spectrogram-based methods, which more naturally capture harmonic and temporal structures. At the same time, advances in image synthesis have shown that autoregression across scales, rather than tokens, improves coherence and detail. Building on these ideas, we introduce MARS (Multi-channel AutoRegression on Spectrograms), a framework that treats spectrograms as multi-channel images and employs channel multiplexing (CMX), a reshaping technique that lowers height and width without discarding information. A shared tokenizer provides consistent discrete representations across scales, enabling a transformer-based autoregressor to refine spectrograms from coarse to fine resolutions efficiently. Experiments on a large-scale dataset demonstrate that MARS performs comparably or better than state-of-the-art baselines across multiple evaluation metrics, establishing an efficient and scalable paradigm for high-fidelity audio generation.
Related papers
- WaveMAE: Wavelet decomposition Masked Auto-Encoder for Remote Sensing [5.65492058135409]
WaveMAE is a masked autoencoding framework tailored for multispectral satellite imagery.<n>To ensure fairness in evaluation, all methods are pretrained on the same dataset (fMoW-S2)<n>WaveMAE achieves consistent improvements over prior state-of-the-art approaches.
arXiv Detail & Related papers (2025-10-26T14:45:30Z) - MARS-Sep: Multimodal-Aligned Reinforced Sound Separation [72.85468563236005]
MARS-Sep is a reinforcement learning framework for sound separation.<n>It learns a factorized Beta mask policy that is optimized by a clipped trust-region surrogate.<n>Experiments on multiple benchmarks demonstrate consistent gains in Text-, Audio-, and Image-Queried separation.
arXiv Detail & Related papers (2025-10-12T09:05:28Z) - Learning Multi-scale Spatial-frequency Features for Image Denoising [58.883244886588336]
We propose a novel multi-scale adaptive dual-domain network (MADNet) for image denoising.<n>We use image pyramid inputs to restore noise-free results from low-resolution images.<n>In order to realize the interaction of high-frequency and low-frequency information, we design an adaptive spatial-frequency learning unit.
arXiv Detail & Related papers (2025-06-19T13:28:09Z) - Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM [12.005825075325234]
We propose a unified multi-ASR prompt-driven framework using postprocessing by either textual or speech-based large language models.<n>We show significant improvements in transcription accuracy compared to traditional methods.
arXiv Detail & Related papers (2025-06-05T12:35:53Z) - Freqformer: Image-Demoiréing Transformer via Efficient Frequency Decomposition [83.40450475728792]
We present Freqformer, a Transformer-based framework specifically designed for image demoir'eing through targeted frequency separation.<n>Our method performs an effective frequency decomposition that explicitly splits moir'e patterns into high-frequency spatially-localized textures and low-frequency scale-robust color distortions.<n>Experiments on various demoir'eing benchmarks demonstrate that Freqformer achieves state-of-the-art performance with a compact model size.
arXiv Detail & Related papers (2025-05-25T12:23:10Z) - SinBasis Networks: Matrix-Equivalent Feature Extraction for Wave-Like Optical Spectrograms [8.37266944852829]
We propose a unified, matrix-equivalent framework that reinterprets convolution and attention as linear transforms on flattened inputs.<n> Embedding these transforms into CNN, ViT and Capsule architectures yields Sin-Basis Networks with heightened sensitivity to periodic motifs.
arXiv Detail & Related papers (2025-05-06T16:16:42Z) - Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder.<n>Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z) - A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis [3.9940425551415597]
We propose a Mel spectrogram enhancement paradigm based on the continuous wavelet transform (CWT)
This paradigm introduces a more detailed wavelet spectrogram, which like the post-processing network takes as input the Mel spectrogram output by the decoder.
The experimental results demonstrate that the speech synthesised using the model with the Mel spectrogram enhancement paradigm exhibits higher MOS, with an improvement of 0.14 and 0.09 compared to the baseline model, respectively.
arXiv Detail & Related papers (2024-06-18T00:34:44Z) - Exploring Self-Supervised Contrastive Learning of Spatial Sound Event
Representation [21.896817015593122]
MC-SimCLR learns joint spectral and spatial representations from unlabeled spatial audios.
We propose a multi-level data augmentation pipeline that augments different levels of audio features.
We find that linear layers on top of the learned representation significantly outperform supervised models in terms of both event classification accuracy and localization error.
arXiv Detail & Related papers (2023-09-27T18:23:03Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.