Spatial and spectral deep attention fusion for multi-channel speech
separation using deep embedding features
- URL: http://arxiv.org/abs/2002.01626v1
- Date: Wed, 5 Feb 2020 03:49:39 GMT
- Title: Spatial and spectral deep attention fusion for multi-channel speech
separation using deep embedding features
- Authors: Cunhang Fan, Bin Liu, Jianhua Tao, Jiangyan Yi, and Zhengqi Wen
- Abstract summary: Multi-channel deep clustering (MDC) has acquired a good performance for speech separation.
We propose a deep attention fusion method to dynamically control the weights of the spectral and spatial features and combine them deeply.
Experimental results show that the proposed method outperforms MDC baseline and even better than the ideal binary mask (IBM)
- Score: 60.20150317299749
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-channel deep clustering (MDC) has acquired a good performance for
speech separation. However, MDC only applies the spatial features as the
additional information. So it is difficult to learn mutual relationship between
spatial and spectral features. Besides, the training objective of MDC is
defined at embedding vectors, rather than real separated sources, which may
damage the separation performance. In this work, we propose a deep attention
fusion method to dynamically control the weights of the spectral and spatial
features and combine them deeply. In addition, to solve the training objective
problem of MDC, the real separated sources are used as the training objectives.
Specifically, we apply the deep clustering network to extract deep embedding
features. Instead of using the unsupervised K-means clustering to estimate
binary masks, another supervised network is utilized to learn soft masks from
these deep embedding features. Our experiments are conducted on a spatialized
reverberant version of WSJ0-2mix dataset. Experimental results show that the
proposed method outperforms MDC baseline and even better than the oracle ideal
binary mask (IBM).
Related papers
- DAAL: Density-Aware Adaptive Line Margin Loss for Multi-Modal Deep Metric Learning [1.9472493183927981]
We propose a novel loss function called Density-Aware Adaptive Margin Loss (DAAL)
DAAL preserves the density distribution of embeddings while encouraging the formation of adaptive sub-clusters within each class.
Experiments on benchmark fine-grained datasets demonstrate the superior performance of DAAL.
arXiv Detail & Related papers (2024-10-07T19:04:24Z) - Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization [52.87635234206178]
This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization.
The framework incorporates two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM)
arXiv Detail & Related papers (2024-08-05T08:35:59Z) - AMFD: Distillation via Adaptive Multimodal Fusion for Multispectral Pedestrian Detection [23.91870504363899]
Double-stream networks in multispectral detection employ two separate feature extraction branches for multi-modal data.
This has hindered the widespread employment of multispectral pedestrian detection in embedded devices for autonomous systems.
We introduce the Adaptive Modal Fusion Distillation (AMFD) framework, which can fully utilize the original modal features of the teacher network.
arXiv Detail & Related papers (2024-05-21T17:17:17Z) - SS-MAE: Spatial-Spectral Masked Auto-Encoder for Multi-Source Remote
Sensing Image Classification [35.52272615695294]
We propose a spatial-spectral masked auto-encoder (SS-MAE) for HSI and LiDAR/SAR data joint classification.
Our SS-MAE fully exploits the spatial and spectral representations of the input data.
To complement local features in the training stage, we add two lightweight CNNs for feature extraction.
arXiv Detail & Related papers (2023-11-08T03:54:44Z) - Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement
Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images.
We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy.
Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z) - Multi-scale and Cross-scale Contrastive Learning for Semantic
Segmentation [5.281694565226513]
We apply contrastive learning to enhance the discriminative power of the multi-scale features extracted by semantic segmentation networks.
By first mapping the encoder's multi-scale representations to a common feature space, we instantiate a novel form of supervised local-global constraint.
arXiv Detail & Related papers (2022-03-25T01:24:24Z) - Neural Manifold Clustering and Embedding [13.08270828061924]
Non-linear subspace clustering or manifold clustering aims to cluster data points based on manifold structures and learn to parameterize each manifold as a linear subspace in a feature space.
Deep neural networks have the potential to achieve this goal under highly non-linear settings given their large capacity and flexibility.
We argue that achieving manifold clustering with neural networks requires two essential ingredients: a domain-specific constraint that ensures the identification of the manifold, and a learning algorithm for embedding each manifold to a linear subspace in the feature space.
arXiv Detail & Related papers (2022-01-24T23:13:37Z) - Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains.
We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z) - Hierarchical Deep CNN Feature Set-Based Representation Learning for
Robust Cross-Resolution Face Recognition [59.29808528182607]
Cross-resolution face recognition (CRFR) is important in intelligent surveillance and biometric forensics.
Existing shallow learning-based and deep learning-based methods focus on mapping the HR-LR face pairs into a joint feature space.
In this study, we desire to fully exploit the multi-level deep convolutional neural network (CNN) feature set for robust CRFR.
arXiv Detail & Related papers (2021-03-25T14:03:42Z) - Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features.
At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features.
At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.