A Multimodal Canonical-Correlated Graph Neural Network for
Energy-Efficient Speech Enhancement
- URL: http://arxiv.org/abs/2202.04528v1
- Date: Wed, 9 Feb 2022 15:47:07 GMT
- Title: A Multimodal Canonical-Correlated Graph Neural Network for
Energy-Efficient Speech Enhancement
- Authors: Leandro Aparecido Passos, Jo\~ao Paulo Papa, Amir Hussain, Ahsan Adeel
- Abstract summary: This paper proposes a novel multimodal self-supervised architecture for energy-efficient AV speech enhancement.
It integrates graph neural networks with canonical correlation analysis (CCA-GNN)
Experiments conducted with the benchmark ChiME3 dataset show that our proposed prior frame-based AV CCA-GNN reinforces better feature learning in the temporal context.
- Score: 4.395837214164745
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes a novel multimodal self-supervised architecture for
energy-efficient AV speech enhancement by integrating graph neural networks
with canonical correlation analysis (CCA-GNN). This builds on a
state-of-the-art CCA-GNN that aims to learn representative embeddings by
maximizing the correlation between pairs of augmented views of the same input
while decorrelating disconnected features. The key idea of the conventional
CCA-GNN involves discarding augmentation-variant information and preserving
augmentation-invariant information whilst preventing capturing of redundant
information. Our proposed AV CCA-GNN model is designed to deal with the
challenging multimodal representation learning context. Specifically, our model
improves contextual AV speech processing by maximizing canonical correlation
from augmented views of the same channel, as well as canonical correlation from
audio and visual embeddings. In addition, we propose a positional encoding of
the nodes that considers a prior-frame sequence distance instead of a
feature-space representation while computing the node's nearest neighbors. This
serves to introduce temporal information in the embeddings through the
neighborhood's connectivity. Experiments conducted with the benchmark ChiME3
dataset show that our proposed prior frame-based AV CCA-GNN reinforces better
feature learning in the temporal context, leading to more energy-efficient
speech reconstruction compared to state-of-the-art CCA-GNN and multi-layer
perceptron models. The results demonstrate the potential of our proposed
approach for exploitation in future assistive technology and energy-efficient
multimodal devices.
Related papers
- Canonical Correlation Guided Deep Neural Network [14.188285111418516]
We present a canonical correlation guided learning framework, which allows to be realized by deep neural networks (CCDNN)
In the proposed method, the optimization formulation is not restricted to maximize correlation, instead we make canonical correlation as a constraint.
To reduce the redundancy induced by correlation, a redundancy filter is designed.
arXiv Detail & Related papers (2024-09-28T16:08:44Z) - Enhancing Graph Contrastive Learning with Reliable and Informative Augmentation for Recommendation [84.45144851024257]
CoGCL aims to enhance graph contrastive learning by constructing contrastive views with stronger collaborative information via discrete codes.
We introduce a multi-level vector quantizer in an end-to-end manner to quantize user and item representations into discrete codes.
For neighborhood structure, we propose virtual neighbor augmentation by treating discrete codes as virtual neighbors.
Regarding semantic relevance, we identify similar users/items based on shared discrete codes and interaction targets to generate the semantically relevant view.
arXiv Detail & Related papers (2024-09-09T14:04:17Z) - TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - Hybrid Convolutional and Attention Network for Hyperspectral Image Denoising [54.110544509099526]
Hyperspectral image (HSI) denoising is critical for the effective analysis and interpretation of hyperspectral data.
We propose a hybrid convolution and attention network (HCANet) to enhance HSI denoising.
Experimental results on mainstream HSI datasets demonstrate the rationality and effectiveness of the proposed HCANet.
arXiv Detail & Related papers (2024-03-15T07:18:43Z) - Dynamic Semantic Compression for CNN Inference in Multi-access Edge
Computing: A Graph Reinforcement Learning-based Autoencoder [82.8833476520429]
We propose a novel semantic compression method, autoencoder-based CNN architecture (AECNN) for effective semantic extraction and compression in partial offloading.
In the semantic encoder, we introduce a feature compression module based on the channel attention mechanism in CNNs, to compress intermediate data by selecting the most informative features.
In the semantic decoder, we design a lightweight decoder to reconstruct the intermediate data through learning from the received compressed data to improve accuracy.
arXiv Detail & Related papers (2024-01-19T15:19:47Z) - An Efficient Speech Separation Network Based on Recurrent Fusion Dilated
Convolution and Channel Attention [0.2538209532048866]
We present an efficient speech separation neural network, ARFDCN, which combines dilated convolutions, multi-scale fusion (MSF), and channel attention.
Experimental results indicate that the model achieves a decent balance between performance and computational efficiency.
arXiv Detail & Related papers (2023-06-09T13:30:27Z) - Dynamic Kernels and Channel Attention with Multi-Layer Embedding
Aggregation for Speaker Verification [28.833851817220616]
This paper proposes an approach to increase the model resolution capability using attention-based dynamic kernels in a convolutional neural network.
The proposed dynamic convolutional model achieved 1.62% EER and 0.18 miniDCF on the VoxCeleb1 test set and has a 17% relative improvement compared to the ECAPA-TDNN.
arXiv Detail & Related papers (2022-11-03T17:13:28Z) - Canonical Cortical Graph Neural Networks and its Application for Speech
Enhancement in Future Audio-Visual Hearing Aids [0.726437825413781]
This paper proposes a more biologically plausible self-supervised machine learning approach that combines multimodal information using intra-layer modulations together with canonical correlation analysis (CCA)
The approach outperformed recent state-of-the-art results considering both better clean audio reconstruction and energy efficiency, described by a reduced and smother neuron firing rate distribution.
arXiv Detail & Related papers (2022-06-06T15:20:07Z) - Graph-based Algorithm Unfolding for Energy-aware Power Allocation in
Wireless Networks [27.600081147252155]
We develop a novel graph sumable framework to maximize energy efficiency in wireless communication networks.
We show the permutation training which is a desirable property for models of wireless network data.
Results demonstrate its generalizability across different network topologies.
arXiv Detail & Related papers (2022-01-27T20:23:24Z) - Hybrid SNN-ANN: Energy-Efficient Classification and Object Detection for
Event-Based Vision [64.71260357476602]
Event-based vision sensors encode local pixel-wise brightness changes in streams of events rather than image frames.
Recent progress in object recognition from event-based sensors has come from conversions of deep neural networks.
We propose a hybrid architecture for end-to-end training of deep neural networks for event-based pattern recognition and object detection.
arXiv Detail & Related papers (2021-12-06T23:45:58Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.