Related papers: Attentional Graph Convolutional Network for Structure-aware Audio-Visual Scene Classification

Attentional Graph Convolutional Network for Structure-aware Audio-Visual Scene Classification

URL: http://arxiv.org/abs/2301.00145v1
Date: Sat, 31 Dec 2022 07:56:00 GMT
Title: Attentional Graph Convolutional Network for Structure-aware Audio-Visual Scene Classification
Authors: Liguang Zhou, Yuhongze Zhou, Xiaonan Qi, Junjie Hu, Tin Lun Lam, Yangsheng Xu
Abstract summary: We present an end-to-end framework, namely attentional graph convolutional network (AGCN) for structure-aware audio-visual scene representation. To well represent the salient regions and contextual information of audio-visual inputs, the salient acoustic graph (SAG) and contextual acoustic graph (CAG) are constructed. Finally, the constructed graphs pass through a graph convolutional network for structure-aware audio-visual scene recognition.
Score: 15.559827597608466
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Audio-Visual scene understanding is a challenging problem due to the unstructured spatial-temporal relations that exist in the audio signals and spatial layouts of different objects and various texture patterns in the visual images. Recently, many studies have focused on abstracting features from convolutional neural networks while the learning of explicit semantically relevant frames of sound signals and visual images has been overlooked. To this end, we present an end-to-end framework, namely attentional graph convolutional network (AGCN), for structure-aware audio-visual scene representation. First, the spectrogram of sound and input image is processed by a backbone network for feature extraction. Then, to build multi-scale hierarchical information of input features, we utilize an attention fusion mechanism to aggregate features from multiple layers of the backbone network. Notably, to well represent the salient regions and contextual information of audio-visual inputs, the salient acoustic graph (SAG) and contextual acoustic graph (CAG), salient visual graph (SVG), and contextual visual graph (CVG) are constructed for the audio-visual scene representation. Finally, the constructed graphs pass through a graph convolutional network for structure-aware audio-visual scene recognition. Extensive experimental results on the audio, visual and audio-visual scene recognition datasets show that promising results have been achieved by the AGCN methods. Visualizing graphs on the spectrograms and images have been presented to show the effectiveness of proposed CAG/SAG and CVG/SVG that could focus on the salient and semantic relevant regions.

Related papers

Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment [18.08290178587821]
We propose a method for generating images of visual scenes from diverse in-the-wild sounds. This cross-modal generation task is challenging due to the significant information gap between auditory and visual signals.
arXiv Detail & Related papers (2024-12-09T05:04:50Z)
Relevance-guided Audio Visual Fusion for Video Saliency Prediction [23.873134951154704]
We propose a novel relevance-guided audio-visual saliency prediction network dubbedSP. The Fusion module dynamically adjusts the retention of audio features based on the semantic relevance between audio and visual elements. The Multi-scale feature Synergy (MS) module integrates visual features from different encoding stages, enhancing the network's ability to represent objects at various scales.
arXiv Detail & Related papers (2024-11-18T10:42:27Z)
AV-CrossNet: an Audiovisual Complex Spectral Mapping Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling [48.23652686272613]
This paper introduces AV-CrossNet, an glsav system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral mapping for speech separation. Evaluation results demonstrate that AV-CrossNet advances the state-of-the-art performance in all audiovisual tasks, even on untrained and mismatched datasets.
arXiv Detail & Related papers (2024-06-17T15:04:15Z)
Cooperative Dual Attention for Audio-Visual Speech Enhancement with Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE) We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z)
Two Stream Scene Understanding on Graph Embedding [4.78180589767256]
The paper presents a novel two-stream network architecture for enhancing scene understanding in computer vision. The graph feature stream network comprises a segmentation structure, scene graph generation, and a graph representation module. Experiments conducted on the ADE20K dataset demonstrate the effectiveness of the proposed two-stream network in improving image classification accuracy.
arXiv Detail & Related papers (2023-11-12T05:57:56Z)
SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning. We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z)
Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video. The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z)
Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs. Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z)
Bio-Inspired Audio-Visual Cues Integration for Visual Attention Prediction [15.679379904130908]
Visual Attention Prediction (VAP) methods simulates the human selective attention mechanism to perceive the scene. A bio-inspired audio-visual cues integration method is proposed for the VAP task, which explores the audio modality to better predict the visual attention map. Experiments are conducted on six challenging audiovisual eye-tracking datasets, including DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD.
arXiv Detail & Related papers (2021-09-17T06:49:43Z)
GINet: Graph Interaction Network for Scene Parsing [58.394591509215005]
We propose a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss) to promote context reasoning over image regions. The proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.
arXiv Detail & Related papers (2020-09-14T02:52:45Z)
Learning Physical Graph Representations from Visual Scenes [56.7938395379406]
Physical Scene Graphs (PSGs) represent scenes as hierarchical graphs with nodes corresponding intuitively to object parts at different scales, and edges to physical connections between parts. PSGNet augments standard CNNs by including: recurrent feedback connections to combine low and high-level image information; graph pooling and vectorization operations that convert spatially-uniform feature maps into object-centric graph structures. We show that PSGNet outperforms alternative self-supervised scene representation algorithms at scene segmentation tasks.
arXiv Detail & Related papers (2020-06-22T16:10:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.