Video-Specific Query-Key Attention Modeling for Weakly-Supervised
Temporal Action Localization
- URL: http://arxiv.org/abs/2305.04186v3
- Date: Mon, 25 Dec 2023 07:10:21 GMT
- Title: Video-Specific Query-Key Attention Modeling for Weakly-Supervised
Temporal Action Localization
- Authors: Xijun Wang, Aggelos K. Katsaggelos
- Abstract summary: Weakly-trimmed temporal action localization aims to identify and localize the action instances in the unsupervised videos with only video-level action labels.
We propose a network named VQK-Net with a video-specific query-key attention modeling that learns a unique query for each action category of each input video.
- Score: 14.43055117008746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly-supervised temporal action localization aims to identify and localize
the action instances in the untrimmed videos with only video-level action
labels. When humans watch videos, we can adapt our abstract-level knowledge
about actions in different video scenarios and detect whether some actions are
occurring. In this paper, we mimic how humans do and bring a new perspective
for locating and identifying multiple actions in a video. We propose a network
named VQK-Net with a video-specific query-key attention modeling that learns a
unique query for each action category of each input video. The learned queries
not only contain the actions' knowledge features at the abstract level but also
have the ability to fit this knowledge into the target video scenario, and they
will be used to detect the presence of the corresponding action along the
temporal dimension. To better learn these action category queries, we exploit
not only the features of the current input video but also the correlation
between different videos through a novel video-specific action category query
learner worked with a query similarity loss. Finally, we conduct extensive
experiments on three commonly used datasets (THUMOS14, ActivityNet1.2, and
ActivityNet1.3) and achieve state-of-the-art performance.
Related papers
- MINOTAUR: Multi-task Video Grounding From Multimodal Queries [70.08973664126873]
We present a single, unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark.
arXiv Detail & Related papers (2023-02-16T04:00:03Z) - Enabling Weakly-Supervised Temporal Action Localization from On-Device
Learning of the Video Stream [5.215681853828831]
We propose an efficient video learning approach to learn from a long, untrimmed streaming video.
To the best of our knowledge, we are the first attempt to directly learn from the on-device, long video stream.
arXiv Detail & Related papers (2022-08-25T13:41:03Z) - Learning to Refactor Action and Co-occurrence Features for Temporal
Action Localization [74.74339878286935]
Action features and co-occurrence features often dominate the actual action content in videos.
We develop a novel auxiliary task by decoupling these two types of features within a video snippet.
We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features.
arXiv Detail & Related papers (2022-06-23T06:30:08Z) - Temporal Action Segmentation with High-level Complex Activity Labels [29.17792724210746]
We learn the action segments taking only the high-level activity labels as input.
We propose a novel action discovery framework that automatically discovers constituent actions in videos.
arXiv Detail & Related papers (2021-08-15T09:50:42Z) - Few-Shot Action Localization without Knowing Boundaries [9.959844922120523]
We show that it is possible to learn to localize actions in untrimmed videos when only one/few trimmed examples of the target action are available at test time.
We propose a network that learns to estimate Temporal Similarity Matrices (TSMs) that model a fine-grained similarity pattern between pairs of videos.
Our method achieves performance comparable or better to state-of-the-art fully-supervised, few-shot learning methods.
arXiv Detail & Related papers (2021-06-08T07:32:43Z) - Learning to Localize Actions from Moments [153.54638582696128]
We introduce a new design of transfer learning type to learn action localization for a large set of action categories.
We present Action Herald Networks (AherNet) that integrate such design into an one-stage action localization framework.
arXiv Detail & Related papers (2020-08-31T16:03:47Z) - Hybrid Dynamic-static Context-aware Attention Network for Action
Assessment in Long Videos [96.45804577283563]
We present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos.
We learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames.
We combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts.
arXiv Detail & Related papers (2020-08-13T15:51:42Z) - Localizing the Common Action Among a Few Videos [51.09824165433561]
This paper strives to localize the temporal extent of an action in a long untrimmed video.
We introduce a new 3D convolutional network architecture able to align representations from the support videos with the relevant query video segments.
arXiv Detail & Related papers (2020-08-13T11:31:23Z) - Revisiting Few-shot Activity Detection with Class Similarity Control [107.79338380065286]
We present a framework for few-shot temporal activity detection based on proposal regression.
Our model is end-to-end trainable, takes into account the frame rate differences between few-shot activities and untrimmed test videos, and can benefit from additional few-shot examples.
arXiv Detail & Related papers (2020-03-31T22:02:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.