Related papers: Mono-to-stereo through parametric stereo generation

Mono-to-stereo through parametric stereo generation

URL: http://arxiv.org/abs/2306.14647v1
Date: Mon, 26 Jun 2023 12:33:29 GMT
Title: Mono-to-stereo through parametric stereo generation
Authors: Joan Serr\`a, Davide Scaini, Santiago Pascual, Daniel Arteaga, Jordi Pons, Jeroen Breebaart, Giulio Cengarle
Abstract summary: We propose to convert mono to stereo by means of predicting parametric stereo parameters. In combination with PS, we also propose to model the task with generative approaches. We provide evidence that the proposed PS-based models outperform a competitive classical decorrelation baseline.
Score: 21.502860265488216
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generating a stereophonic presentation from a monophonic audio signal is a challenging open task, especially if the goal is to obtain a realistic spatial imaging with a specific panning of sound elements. In this work, we propose to convert mono to stereo by means of predicting parametric stereo (PS) parameters using both nearest neighbor and deep network approaches. In combination with PS, we also propose to model the task with generative approaches, allowing to synthesize multiple and equally-plausible stereo renditions from the same mono signal. To achieve this, we consider both autoregressive and masked token modelling approaches. We provide evidence that the proposed PS-based models outperform a competitive classical decorrelation baseline and that, within a PS prediction framework, modern generative models outshine equivalent non-generative counterparts. Overall, our work positions both PS and generative modelling as strong and appealing methodologies for mono-to-stereo upmixing. A discussion of the limitations of these approaches is also provided.

Related papers

Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis [4.062046658662013]
We propose a novel autoregressive modeling approach for speech synthesis. We combine a variational autoencoder (VAE) with a multi-modal latent space and an autoregressive model that uses Gaussian Mixture Models (GMM) as the conditional probability distribution. Our approach significantly outperforms the state-of-the-art autoregressive model VALL-E in both subjective and objective evaluations.
arXiv Detail & Related papers (2025-02-03T05:53:59Z)
FoundationStereo: Zero-Shot Stereo Matching [50.79202911274819]
FoundationStereo is a foundation model for stereo depth estimation. We first construct a large-scale (1M stereo pairs) synthetic training dataset. We then design a number of network architecture components to enhance scalability.
arXiv Detail & Related papers (2025-01-17T01:01:44Z)
Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail [37.90622613373521]
We introduce Stereo Anywhere, a novel stereo-matching framework that combines geometric constraints with robust priors from monocular depth Vision Foundation Models (VFMs) We show that our synthetic-only trained model achieves state-of-the-art results in zero-shot generalization, significantly outperforming existing solutions.
arXiv Detail & Related papers (2024-12-05T18:59:58Z)
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation [32.24603883810094]
Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models. We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources. By leveraging spatial guidance, our unified model achieves the objective of generating immersive and controllable spatial audio from text and image.
arXiv Detail & Related papers (2024-10-14T16:18:29Z)
MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling [18.02254687807291]
Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task. We propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM) in training Transformer-based stereo model.
arXiv Detail & Related papers (2024-09-04T16:17:45Z)
StereoDiffusion: Training-Free Stereo Image Generation Using Latent Diffusion Models [2.9260206957981167]
We introduce StereoDiffusion, a method that is trainning free, remarkably straightforward to use, and seamlessly integrates into the original Stable Diffusion model. Our method modifies the latent variable to provide an end-to-end, lightweight capability for fast generation of stereo image pairs. Our proposed method maintains a high standard of image quality throughout the stereo generation process, achieving state-of-the-art scores in various quantitative evaluations.
arXiv Detail & Related papers (2024-03-08T00:30:25Z)
Resource-constrained stereo singing voice cancellation [1.0962868591006976]
We study the problem of stereo singing voice cancellation. Our approach is evaluated using objective offline metrics and a large-scale MUSHRA trial.
arXiv Detail & Related papers (2024-01-22T16:05:30Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations. These models are prone to generate audible artifacts when the conditioning is flawed or imperfect. We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z)
Single-View View Synthesis with Self-Rectified Pseudo-Stereo [49.946151180828465]
We leverage the reliable and explicit stereo prior to generate a pseudo-stereo viewpoint. We propose a self-rectified stereo synthesis to amend erroneous regions in an identify-rectify manner. Our method outperforms state-of-the-art single-view view synthesis methods and stereo synthesis methods.
arXiv Detail & Related papers (2023-04-19T09:36:13Z)
Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity. Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation. We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.