Hybrid Spectrogram and Waveform Source Separation
- URL: http://arxiv.org/abs/2111.03600v1
- Date: Fri, 5 Nov 2021 16:37:45 GMT
- Title: Hybrid Spectrogram and Waveform Source Separation
- Authors: Alexandre D\'efossez
- Abstract summary: We show how to perform end-to-end hybrid source separation, letting the model decide which domain is best suited for each source.
The proposed hybrid version of the Demucs architecture won the Music Demixing Challenge 2021 organized by Sony.
- Score: 91.3755431537592
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Source separation models either work on the spectrogram or waveform domain.
In this work, we show how to perform end-to-end hybrid source separation,
letting the model decide which domain is best suited for each source, and even
combining both. The proposed hybrid version of the Demucs architecture won the
Music Demixing Challenge 2021 organized by Sony. This architecture also comes
with additional improvements, such as compressed residual branches, local
attention or singular value regularization. Overall, a 1.4 dB improvement of
the Signal-To-Distortion (SDR) was observed across all sources as measured on
the MusDB HQ dataset, an improvement confirmed by human subjective evaluation,
with an overall quality rated at 2.83 out of 5 (2.36 for the non hybrid
Demucs), and absence of contamination at 3.04 (against 2.37 for the non hybrid
Demucs and 2.44 for the second ranking model submitted at the competition).
Related papers
- HGMamba: Enhancing 3D Human Pose Estimation with a HyperGCN-Mamba Network [0.0]
3D human pose is a promising research area that leverages estimated and ground-truth 2D human pose data for training.
Existing approaches aim to enhance the performance of estimated 2D poses, but struggle when applied to ground-truth 2D pose data.
We propose a novel Hyper-GCN and Shuffle Mamba block, which processes input data through two parallel streams.
arXiv Detail & Related papers (2025-04-09T07:28:19Z) - CoRe^2: Collect, Reflect and Refine to Generate Better and Faster [11.230943283470522]
We introduce a novel plug-and-play inference paradigm, CoRe2, which comprises three subprocesses: Collect, Reflect, and Refine.
CoRe2 employs weak-to-strong guidance to refine the conditional output, thereby improving the model's capacity to generate high-frequency and realistic content.
It has exhibited significant performance improvements on HPD v2, Pick-of-Pic, Drawbench, GenEval, and T2I-Compbench.
arXiv Detail & Related papers (2025-03-12T15:15:25Z) - Robust Fine-tuning of Zero-shot Models via Variance Reduction [56.360865951192324]
When fine-tuning zero-shot models, our desideratum is for the fine-tuned model to excel in both in-distribution (ID) and out-of-distribution (OOD)
We propose a sample-wise ensembling technique that can simultaneously attain the best ID and OOD accuracy without the trade-offs.
arXiv Detail & Related papers (2024-11-11T13:13:39Z) - Evolving Alignment via Asymmetric Self-Play [52.3079697845254]
We introduce a general open-ended RLHF framework that casts alignment as an asymmetric game between two players.
This framework of Evolving Alignment via Asymmetric Self-Play (eva) results in a simple and efficient approach that can utilize any existing RLHF algorithm for scalable alignment.
arXiv Detail & Related papers (2024-10-31T08:15:32Z) - SZU-AFS Antispoofing System for the ASVspoof 5 Challenge [3.713577625357432]
The SZU-AFS anti-spoofing system was designed for Track 1 of the ASVspoof 5 Challenge under open conditions.
The final fusion system achieves a minDCF of 0.115 and an EER of 4.04% on the evaluation set.
arXiv Detail & Related papers (2024-08-19T12:12:29Z) - PatchFusion: An End-to-End Tile-Based Framework for High-Resolution
Monocular Metric Depth Estimation [47.53810786827547]
Single image depth estimation is a foundational task in computer vision and generative modeling.
We present PatchFusion, a novel tile-based framework with three key components to improve the current state of the art.
Experiments on UnrealStereo4K, MVS- Synth, and Middleburry 2014 demonstrate that our framework can generate high-resolution depth maps with intricate details.
arXiv Detail & Related papers (2023-12-04T19:03:12Z) - Efficient Integrators for Diffusion Generative Models [22.01769257075573]
Diffusion models suffer from slow sample generation at inference time.
We propose two complementary frameworks for accelerating sample generation in pre-trained models.
We present a hybrid method that leads to the best-reported performance for diffusion models in augmented spaces.
arXiv Detail & Related papers (2023-10-11T21:04:42Z) - Occluded Human Mesh Recovery [23.63235079216075]
We present Occluded Human Mesh Recovery (OCHMR) - a novel top-down mesh recovery approach that incorporates image spatial context.
OCHMR achieves superior performance on challenging multi-person benchmarks like 3DPW, CrowdPose and OCHuman.
arXiv Detail & Related papers (2022-03-24T21:39:20Z) - MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose
Estimation in Video [75.23812405203778]
Recent solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn-temporal correlation.
We propose Mix Mix, which has temporal transformer block to separately model the temporal motion of each joint and a transformer block inter-joint spatial correlation.
In addition, the network output is extended from the central frame to entire frames of input video, improving the coherence between the input and output benchmarks.
arXiv Detail & Related papers (2022-03-02T04:20:59Z) - Generalized Focal Loss: Learning Qualified and Distributed Bounding
Boxes for Dense Object Detection [85.53263670166304]
One-stage detector basically formulates object detection as dense classification and localization.
Recent trend for one-stage detectors is to introduce an individual prediction branch to estimate the quality of localization.
This paper delves into the representations of the above three fundamental elements: quality estimation, classification and localization.
arXiv Detail & Related papers (2020-06-08T07:24:33Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.