Centre Stage: Centricity-based Audio-Visual Temporal Action Detection
- URL: http://arxiv.org/abs/2311.16446v1
- Date: Tue, 28 Nov 2023 03:02:00 GMT
- Title: Centre Stage: Centricity-based Audio-Visual Temporal Action Detection
- Authors: Hanyuan Wang, Majid Mirmehdi, Dima Damen, Toby Perrett
- Abstract summary: We explore strategies to incorporate the audio modality, using multi-scale cross-attention to fuse the two modalities.
We propose a novel network head to estimate the closeness of timesteps to the action centre, which we call the centricity score.
- Score: 26.42447737005981
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Previous one-stage action detection approaches have modelled temporal
dependencies using only the visual modality. In this paper, we explore
different strategies to incorporate the audio modality, using multi-scale
cross-attention to fuse the two modalities. We also demonstrate the correlation
between the distance from the timestep to the action centre and the accuracy of
the predicted boundaries. Thus, we propose a novel network head to estimate the
closeness of timesteps to the action centre, which we call the centricity
score. This leads to increased confidence for proposals that exhibit more
precise boundaries. Our method can be integrated with other one-stage
anchor-free architectures and we demonstrate this on three recent baselines on
the EPIC-Kitchens-100 action detection benchmark where we achieve
state-of-the-art performance. Detailed ablation studies showcase the benefits
of fusing audio and our proposed centricity scores. Code and models for our
proposed method are publicly available at
https://github.com/hanielwang/Audio-Visual-TAD.git
Related papers
- Detecting Audio-Visual Deepfakes with Fine-Grained Inconsistencies [11.671275975119089]
We propose the introduction of fine-grained mechanisms for detecting subtle artifacts in both spatial and temporal domains.
First, we introduce a local audio-visual model capable of capturing small spatial regions that are prone to inconsistencies with audio.
Second, we introduce a temporally-local pseudo-fake augmentation to include samples incorporating subtle temporal inconsistencies in our training set.
arXiv Detail & Related papers (2024-08-13T09:19:59Z) - BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos [19.280799998526636]
temporal sentence grounding aims to localize moments relevant to a language description.
We propose a novel boundary-oriented moment formulation.
Experiments on three benchmarks validate the effectiveness of the proposed methods.
arXiv Detail & Related papers (2023-11-30T07:16:11Z) - Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level.
Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion.
We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion [137.8749239614528]
We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD.
Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video.
arXiv Detail & Related papers (2023-03-27T00:40:52Z) - Hear Me Out: Fusional Approaches for Audio Augmented Temporal Action
Localization [7.577219401804674]
We propose simple but effective fusion-based approaches for TAL.
We experimentally show that our schemes consistently improve performance for state of the art video-only TAL approaches.
arXiv Detail & Related papers (2021-06-27T00:49:02Z) - Learning to Estimate Hidden Motions with Global Motion Aggregation [71.12650817490318]
Occlusions pose a significant challenge to optical flow algorithms that rely on local evidences.
We introduce a global motion aggregation module to find long-range dependencies between pixels in the first image.
We demonstrate that the optical flow estimates in the occluded regions can be significantly improved without damaging the performance in non-occluded regions.
arXiv Detail & Related papers (2021-04-06T10:32:03Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture.
We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions.
Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z) - Not only Look, but also Listen: Learning Multimodal Violence Detection
under Weak Supervision [10.859792341257931]
We first release a large-scale and multi-scene dataset named XD-Violence with a total duration of 217 hours.
We propose a neural network containing three parallel branches to capture different relations among video snippets and integrate features.
Our method outperforms other state-of-the-art methods on our released dataset and other existing benchmark.
arXiv Detail & Related papers (2020-07-09T10:29:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.