Mono-to-stereo through parametric stereo generation
- URL: http://arxiv.org/abs/2306.14647v1
- Date: Mon, 26 Jun 2023 12:33:29 GMT
- Title: Mono-to-stereo through parametric stereo generation
- Authors: Joan Serr\`a, Davide Scaini, Santiago Pascual, Daniel Arteaga, Jordi
Pons, Jeroen Breebaart, Giulio Cengarle
- Abstract summary: We propose to convert mono to stereo by means of predicting parametric stereo parameters.
In combination with PS, we also propose to model the task with generative approaches.
We provide evidence that the proposed PS-based models outperform a competitive classical decorrelation baseline.
- Score: 21.502860265488216
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating a stereophonic presentation from a monophonic audio signal is a
challenging open task, especially if the goal is to obtain a realistic spatial
imaging with a specific panning of sound elements. In this work, we propose to
convert mono to stereo by means of predicting parametric stereo (PS) parameters
using both nearest neighbor and deep network approaches. In combination with
PS, we also propose to model the task with generative approaches, allowing to
synthesize multiple and equally-plausible stereo renditions from the same mono
signal. To achieve this, we consider both autoregressive and masked token
modelling approaches. We provide evidence that the proposed PS-based models
outperform a competitive classical decorrelation baseline and that, within a PS
prediction framework, modern generative models outshine equivalent
non-generative counterparts. Overall, our work positions both PS and generative
modelling as strong and appealing methodologies for mono-to-stereo upmixing. A
discussion of the limitations of these approaches is also provided.
Related papers
- Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis [4.062046658662013]
We propose a novel autoregressive modeling approach for speech synthesis.
We combine a variational autoencoder (VAE) with a multi-modal latent space and an autoregressive model that uses Gaussian Mixture Models (GMM) as the conditional probability distribution.
Our approach significantly outperforms the state-of-the-art autoregressive model VALL-E in both subjective and objective evaluations.
arXiv Detail & Related papers (2025-02-03T05:53:59Z) - FoundationStereo: Zero-Shot Stereo Matching [50.79202911274819]
FoundationStereo is a foundation model for stereo depth estimation.
We first construct a large-scale (1M stereo pairs) synthetic training dataset.
We then design a number of network architecture components to enhance scalability.
arXiv Detail & Related papers (2025-01-17T01:01:44Z) - Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail [37.90622613373521]
We introduce Stereo Anywhere, a novel stereo-matching framework that combines geometric constraints with robust priors from monocular depth Vision Foundation Models (VFMs)
We show that our synthetic-only trained model achieves state-of-the-art results in zero-shot generalization, significantly outperforming existing solutions.
arXiv Detail & Related papers (2024-12-05T18:59:58Z) - MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling [18.02254687807291]
Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task.
We propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM) in training Transformer-based stereo model.
arXiv Detail & Related papers (2024-09-04T16:17:45Z) - StereoDiffusion: Training-Free Stereo Image Generation Using Latent Diffusion Models [2.9260206957981167]
We introduce StereoDiffusion, a method that is trainning free, remarkably straightforward to use, and seamlessly integrates into the original Stable Diffusion model.
Our method modifies the latent variable to provide an end-to-end, lightweight capability for fast generation of stereo image pairs.
Our proposed method maintains a high standard of image quality throughout the stereo generation process, achieving state-of-the-art scores in various quantitative evaluations.
arXiv Detail & Related papers (2024-03-08T00:30:25Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Single-View View Synthesis with Self-Rectified Pseudo-Stereo [49.946151180828465]
We leverage the reliable and explicit stereo prior to generate a pseudo-stereo viewpoint.
We propose a self-rectified stereo synthesis to amend erroneous regions in an identify-rectify manner.
Our method outperforms state-of-the-art single-view view synthesis methods and stereo synthesis methods.
arXiv Detail & Related papers (2023-04-19T09:36:13Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.