Multi-Resolution Audio-Visual Feature Fusion for Temporal Action
Localization
- URL: http://arxiv.org/abs/2310.03456v1
- Date: Thu, 5 Oct 2023 10:54:33 GMT
- Title: Multi-Resolution Audio-Visual Feature Fusion for Temporal Action
Localization
- Authors: Edward Fish, Jon Weinbren, Andrew Gilbert
- Abstract summary: This paper introduces the Multi-Resolution Audio-Visual Feature Fusion (MRAV-FF)
MRAV-FF is an innovative method to merge audio-visual data across different temporal resolutions.
- Score: 8.633822294082943
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal Action Localization (TAL) aims to identify actions' start, end, and
class labels in untrimmed videos. While recent advancements using transformer
networks and Feature Pyramid Networks (FPN) have enhanced visual feature
recognition in TAL tasks, less progress has been made in the integration of
audio features into such frameworks. This paper introduces the Multi-Resolution
Audio-Visual Feature Fusion (MRAV-FF), an innovative method to merge
audio-visual data across different temporal resolutions. Central to our
approach is a hierarchical gated cross-attention mechanism, which discerningly
weighs the importance of audio information at diverse temporal scales. Such a
technique not only refines the precision of regression boundaries but also
bolsters classification confidence. Importantly, MRAV-FF is versatile, making
it compatible with existing FPN TAL architectures and offering a significant
enhancement in performance when audio data is available.
Related papers
- Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization [50.122441710500055]
Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video.
Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint.
We present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE.
arXiv Detail & Related papers (2024-09-12T11:54:25Z) - Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - Computation and Parameter Efficient Multi-Modal Fusion Transformer for
Cued Speech Recognition [48.84506301960988]
Cued Speech (CS) is a pure visual coding method used by hearing-impaired people.
automatic CS recognition (ACSR) seeks to transcribe visual cues of speech into text.
arXiv Detail & Related papers (2024-01-31T05:20:29Z) - RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation [18.93255531121519]
We present a novel time-frequency domain audio-visual speech separation method.
RTFS-Net applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform.
This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
arXiv Detail & Related papers (2023-09-29T12:38:00Z) - UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for
Temporal Forgery Localization [16.963092523737593]
We propose a novel framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation.
Our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd.
arXiv Detail & Related papers (2023-08-28T08:20:30Z) - Audio-Visual Glance Network for Efficient Video Recognition [17.95844876568496]
We propose Audio-Visual Network (AVGN) to efficiently process the-temporally important parts of a video.
We use an Audio-Visual Temporal Saliency Transformer (AV-TeST) that estimates the saliency scores of each frame.
We incorporate various training techniques and multi-modal feature fusion to enhance the robustness and effectiveness of our AVGN.
arXiv Detail & Related papers (2023-08-18T05:46:20Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio
Visual Event Localization [14.103742565510387]
We introduce AVE-CLIP, a novel framework that integrates the AudioCLIP pre-trained on large-scale audio-visual data with a multi-window temporal transformer.
Our method achieves state-of-the-art performance on the publicly available AVE dataset with 5.9% mean accuracy improvement.
arXiv Detail & Related papers (2022-10-11T00:15:45Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.