Progressive Attention on Multi-Level Dense Difference Maps for Generic
Event Boundary Detection
- URL: http://arxiv.org/abs/2112.04771v1
- Date: Thu, 9 Dec 2021 09:00:05 GMT
- Title: Progressive Attention on Multi-Level Dense Difference Maps for Generic
Event Boundary Detection
- Authors: Jiaqi Tang, Zhaoyang Liu, Chen Qian, Wayne Wu, Limin Wang
- Abstract summary: Generic event boundary detection is an important yet challenging task in video understanding.
This paper presents an effective and end-to-end learnable framework (DDM-Net) to tackle the diversity and complicated semantics of event boundaries.
- Score: 35.16241630620967
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generic event boundary detection is an important yet challenging task in
video understanding, which aims at detecting the moments where humans naturally
perceive event boundaries. The main challenge of this task is perceiving
various temporal variations of diverse event boundaries. To this end, this
paper presents an effective and end-to-end learnable framework (DDM-Net). To
tackle the diversity and complicated semantics of event boundaries, we make
three notable improvements. First, we construct a feature bank to store
multi-level features of space and time, prepared for difference calculation at
multiple scales. Second, to alleviate inadequate temporal modeling of previous
methods, we present dense difference maps (DDM) to comprehensively characterize
the motion pattern. Finally, we exploit progressive attention on multi-level
DDM to jointly aggregate appearance and motion clues. As a result, DDM-Net
respectively achieves a significant boost of 14% and 8% on Kinetics-GEBD and
TAPOS benchmark, and outperforms the top-1 winner solution of LOVEU
Challenge@CVPR 2021 without bells and whistles. The state-of-the-art result
demonstrates the effectiveness of richer motion representation and more
sophisticated aggregation, in handling the diversity of generic event boundary
detection. Our codes will be made available soon.
Related papers
- EventAug: Multifaceted Spatio-Temporal Data Augmentation Methods for Event-based Learning [15.727918674166714]
Event cameras have demonstrated significant success across a wide range of areas due to its low time latency and high dynamic range.
However, the community faces challenges such as data deficiency and limited diversity, often resulting in over-fitting and inadequate feature learning.
This work aims to introduce a systematic augmentation scheme named EventAug to enrich spatial-temporal diversity.
arXiv Detail & Related papers (2024-09-18T09:01:34Z) - Harnessing Temporal Causality for Advanced Temporal Action Detection [53.654457142657236]
We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on benchmarks.
We ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, and 1st in the Moment Queries track at the Ego4D Challenge 2024.
arXiv Detail & Related papers (2024-07-25T06:03:02Z) - Fine-grained Dynamic Network for Generic Event Boundary Detection [9.17191007695011]
We propose a novel dynamic pipeline for generic event boundaries named DyBDet.
By introducing a multi-exit network architecture, DyBDet automatically learns the allocation to different video snippets.
Experiments on the challenging Kinetics-GEBD and TAPOS datasets demonstrate that adopting the dynamic strategy significantly benefits GEBD tasks.
arXiv Detail & Related papers (2024-07-05T06:02:46Z) - M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z) - Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and
Contrastive Meta-Learning [51.03781020616402]
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications.
We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class.
Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
arXiv Detail & Related papers (2021-08-15T02:21:01Z) - Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based
Action Recognition [49.163326827954656]
We propose a novel multi-granular-temporal graph network for skeleton-based action classification.
We develop a dual-head graph network consisting of two inter-leaved branches, which enables us to extract at least two-temporal resolutions.
We conduct extensive experiments on three large-scale datasets.
arXiv Detail & Related papers (2021-08-10T09:25:07Z) - One for All: An End-to-End Compact Solution for Hand Gesture Recognition [8.321276216978637]
This paper proposes an end-to-end compact CNN framework: fine grained feature attentive network for hand gesture recognition (Fit-Hand)
The pipeline of the proposed architecture consists of two main units: FineFeat module and dilated convolutional (Conv) layer.
The effectiveness of Fit-Hand is evaluated by using subject dependent (SD) and subject independent (SI) validation setup over seven benchmark datasets.
arXiv Detail & Related papers (2021-05-15T05:10:47Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - Dense Scene Multiple Object Tracking with Box-Plane Matching [73.54369833671772]
Multiple Object Tracking (MOT) is an important task in computer vision.
We propose the Box-Plane Matching (BPM) method to improve the MOT performacne in dense scenes.
With the effectiveness of the three modules, our team achieves the 1st place on the Track-1 leaderboard in the ACM MM Grand Challenge HiEve 2020.
arXiv Detail & Related papers (2020-07-30T16:39:22Z) - DFNet: Discriminative feature extraction and integration network for
salient object detection [6.959742268104327]
We focus on two aspects of challenges in saliency detection using Convolutional Neural Networks.
Firstly, since salient objects appear in various sizes, using single-scale convolution would not capture the right size.
Secondly, using multi-level features helps the model use both local and global context.
arXiv Detail & Related papers (2020-04-03T13:56:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.