LHGNN: Local-Higher Order Graph Neural Networks For Audio Classification and Tagging
- URL: http://arxiv.org/abs/2501.03464v2
- Date: Wed, 29 Jan 2025 12:22:49 GMT
- Title: LHGNN: Local-Higher Order Graph Neural Networks For Audio Classification and Tagging
- Authors: Shubhr Singh, Emmanouil Benetos, Huy Phan, Dan Stowell,
- Abstract summary: This work introduces the Local- Higher Order Graph Neural Network (LHGNN), a graph based model that enhances feature understanding.
Evaluation of the model on three publicly available audio datasets shows that it outperforms Transformer-based models across all benchmarks.
- Score: 23.464493621300242
- License:
- Abstract: Transformers have set new benchmarks in audio processing tasks, leveraging self-attention mechanisms to capture complex patterns and dependencies within audio data. However, their focus on pairwise interactions limits their ability to process the higher-order relations essential for identifying distinct audio objects. To address this limitation, this work introduces the Local- Higher Order Graph Neural Network (LHGNN), a graph based model that enhances feature understanding by integrating local neighbourhood information with higher-order data from Fuzzy C-Means clusters, thereby capturing a broader spectrum of audio relationships. Evaluation of the model on three publicly available audio datasets shows that it outperforms Transformer-based models across all benchmarks while operating with substantially fewer parameters. Moreover, LHGNN demonstrates a distinct advantage in scenarios lacking ImageNet pretraining, establishing its effectiveness and efficiency in environments where extensive pretraining data is unavailable.
Related papers
- Dual-Frequency Filtering Self-aware Graph Neural Networks for Homophilic and Heterophilic Graphs [60.82508765185161]
We propose Dual-Frequency Filtering Self-aware Graph Neural Networks (DFGNN)
DFGNN integrates low-pass and high-pass filters to extract smooth and detailed topological features.
It dynamically adjusts filtering ratios to accommodate both homophilic and heterophilic graphs.
arXiv Detail & Related papers (2024-11-18T04:57:05Z) - GraFPrint: A GNN-Based Approach for Audio Identification [11.71702857714935]
GraFPrint is an audio identification framework that leverages the structural learning capabilities of Graph Neural Networks (GNNs) to create robust audio fingerprints.
GraFPrint demonstrates superior performance on large-scale datasets at various levels of granularity, proving to be both lightweight and scalable.
arXiv Detail & Related papers (2024-10-14T18:20:09Z) - Noise-Resilient Unsupervised Graph Representation Learning via Multi-Hop Feature Quality Estimation [53.91958614666386]
Unsupervised graph representation learning (UGRL) based on graph neural networks (GNNs)
We propose a novel UGRL method based on Multi-hop feature Quality Estimation (MQE)
arXiv Detail & Related papers (2024-07-29T12:24:28Z) - Hybrid Convolutional and Attention Network for Hyperspectral Image Denoising [54.110544509099526]
Hyperspectral image (HSI) denoising is critical for the effective analysis and interpretation of hyperspectral data.
We propose a hybrid convolution and attention network (HCANet) to enhance HSI denoising.
Experimental results on mainstream HSI datasets demonstrate the rationality and effectiveness of the proposed HCANet.
arXiv Detail & Related papers (2024-03-15T07:18:43Z) - ATGNN: Audio Tagging Graph Neural Network [25.78859233831268]
ATGNN is a graph neural architecture that maps semantic relationships between learnable class embeddings and spectrogram regions.
We evaluate ATGNN on two audio tagging tasks, where it achieves 0.585 mAP on the FSD50K dataset and 0.335 mAP on the AudioSet-balanced dataset.
arXiv Detail & Related papers (2023-11-02T18:19:26Z) - ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event
Classification [42.95038619688867]
ASiT is a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation.
We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification.
arXiv Detail & Related papers (2022-11-23T18:21:09Z) - Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision.
A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive.
We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z) - A Multimodal Canonical-Correlated Graph Neural Network for
Energy-Efficient Speech Enhancement [4.395837214164745]
This paper proposes a novel multimodal self-supervised architecture for energy-efficient AV speech enhancement.
It integrates graph neural networks with canonical correlation analysis (CCA-GNN)
Experiments conducted with the benchmark ChiME3 dataset show that our proposed prior frame-based AV CCA-GNN reinforces better feature learning in the temporal context.
arXiv Detail & Related papers (2022-02-09T15:47:07Z) - Learning Spatial-Temporal Graphs for Active Speaker Detection [26.45877018368872]
SPELL is a framework that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data.
We first construct a graph from a video so that each node corresponds to one person.
We demonstrate that learning graph-based representation, owing to its explicit spatial and temporal structure, significantly improves the overall performance.
arXiv Detail & Related papers (2021-12-02T18:29:07Z) - Local Augmentation for Graph Neural Networks [78.48812244668017]
We introduce the local augmentation, which enhances node features by its local subgraph structures.
Based on the local augmentation, we further design a novel framework: LA-GNN, which can apply to any GNN models in a plug-and-play manner.
arXiv Detail & Related papers (2021-09-08T18:10:08Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.