SeCo: Separating Unknown Musical Visual Sounds with Consistency Guidance
- URL: http://arxiv.org/abs/2203.13535v1
- Date: Fri, 25 Mar 2022 09:42:11 GMT
- Title: SeCo: Separating Unknown Musical Visual Sounds with Consistency Guidance
- Authors: Xinchi Zhou, Dongzhan Zhou, Wanli Ouyang, Hang Zhou, Ziwei Liu, and Di
Hu
- Abstract summary: This work focuses on the separation of unknown musical instruments.
We propose the Separation-with-Consistency (SeCo) framework, which can accomplish the separation on unknown categories.
Our framework exhibits strong adaptation ability on the novel musical categories and outperforms the baseline methods by a significant margin.
- Score: 88.0355290619761
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have witnessed the success of deep learning on the visual sound
separation task. However, existing works follow similar settings where the
training and testing datasets share the same musical instrument categories,
which to some extent limits the versatility of this task. In this work, we
focus on a more general and challenging scenario, namely the separation of
unknown musical instruments, where the categories in training and testing
phases have no overlap with each other. To tackle this new setting, we propose
the Separation-with-Consistency (SeCo) framework, which can accomplish the
separation on unknown categories by exploiting the consistency constraints.
Furthermore, to capture richer characteristics of the novel melodies, we devise
an online matching strategy, which can bring stable enhancements with no cost
of extra parameters. Experiments demonstrate that our SeCo framework exhibits
strong adaptation ability on the novel musical categories and outperforms the
baseline methods by a significant margin.
Related papers
- Continual Audio-Visual Sound Separation [35.06195539944879]
We introduce a novel continual audio-visual sound separation task, aiming to continuously separate sound sources for new classes.
We propose a novel Cross-modal Similarity Distillation Constraint (CrossSDC) to uphold the cross-modal semantic similarity through incremental tasks.
Experiments demonstrate that ContAV-Sep can effectively mitigate catastrophic forgetting and achieve significantly better performance compared to other continual learning baselines.
arXiv Detail & Related papers (2024-11-05T07:09:14Z) - Unlocking Potential in Pre-Trained Music Language Models for Versatile Multi-Track Music Arrangement [10.714947060480426]
We propose a unified sequence-to-sequence framework that enables the fine-tuning of a symbolic music language model.
Our experiments demonstrate that the proposed approach consistently achieves higher musical quality compared to task-specific baselines.
arXiv Detail & Related papers (2024-08-27T16:18:51Z) - Strike a Balance in Continual Panoptic Segmentation [60.26892488010291]
We introduce past-class backtrace distillation to balance the stability of existing knowledge with the adaptability to new information.
We also introduce a class-proportional memory strategy, which aligns the class distribution in the replay sample set with that of the historical training data.
We present a new method named Continual Panoptic Balanced (BalConpas)
arXiv Detail & Related papers (2024-07-23T09:58:20Z) - Structured Multi-Track Accompaniment Arrangement via Style Prior Modelling [9.489311894706765]
In this paper, we introduce a novel system that leverages prior modelling over disentangled style factors to address these challenges.
Our key design is the use of vector quantization and a unique multi-stream Transformer to model the long-term flow of the orchestration style.
We show that our system achieves superior coherence, structure, and overall arrangement quality compared to existing baselines.
arXiv Detail & Related papers (2023-10-25T03:30:37Z) - Unsupervised Meta-Learning via Few-shot Pseudo-supervised Contrastive
Learning [72.3506897990639]
We propose a simple yet effective unsupervised meta-learning framework, coined Pseudo-supervised Contrast (PsCo) for few-shot classification.
PsCo outperforms existing unsupervised meta-learning methods under various in-domain and cross-domain few-shot classification benchmarks.
arXiv Detail & Related papers (2023-03-02T06:10:13Z) - Dynamic Supervisor for Cross-dataset Object Detection [52.95818230087297]
Cross-dataset training in object detection tasks is complicated because the inconsistency in the category range across datasets transforms fully supervised learning into semi-supervised learning.
We propose a dynamic supervisor framework that updates the annotations multiple times through multiple-updated submodels trained using hard and soft labels.
In the final generated annotations, both recall and precision improve significantly through the integration of hard-label training with soft-label training.
arXiv Detail & Related papers (2022-04-01T03:18:46Z) - Generating Lead Sheets with Affect: A Novel Conditional seq2seq
Framework [3.029434408969759]
We present a novel approach for calculating the positivity or negativity of a chord progression within a lead sheet.
Our approach is similar to a Neural Machine Translation (NMT) problem, as we include high-level conditions in the encoder part of the sequence-to-sequence architectures.
The proposed strategy is able to generate lead sheets in a controllable manner, resulting in distributions of musical attributes similar to those of the training dataset.
arXiv Detail & Related papers (2021-04-27T09:04:21Z) - Structure-Aware Audio-to-Score Alignment using Progressively Dilated
Convolutional Neural Networks [8.669338893753885]
The identification of structural differences between a music performance and the score is a challenging yet integral step of audio-to-score alignment.
We present a novel method to detect such differences using progressively dilated convolutional neural networks.
arXiv Detail & Related papers (2021-01-31T05:14:58Z) - Dynamic Semantic Matching and Aggregation Network for Few-shot Intent
Detection [69.2370349274216]
Few-shot Intent Detection is challenging due to the scarcity of available annotated utterances.
Semantic components are distilled from utterances via multi-head self-attention.
Our method provides a comprehensive matching measure to enhance representations of both labeled and unlabeled instances.
arXiv Detail & Related papers (2020-10-06T05:16:38Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.