Few-Shot Temporal Action Localization with Query Adaptive Transformer
- URL: http://arxiv.org/abs/2110.10552v1
- Date: Wed, 20 Oct 2021 13:18:01 GMT
- Title: Few-Shot Temporal Action Localization with Query Adaptive Transformer
- Authors: Sauradip Nag, Xiatian Zhu and Tao Xiang
- Abstract summary: TAL works rely on a large number of training videos with exhaustive segment-level annotation.
Few-shot TAL aims to adapt a model to a new class represented by as few as a single video.
- Score: 105.84328176530303
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing temporal action localization (TAL) works rely on a large number of
training videos with exhaustive segment-level annotation, preventing them from
scaling to new classes. As a solution to this problem, few-shot TAL (FS-TAL)
aims to adapt a model to a new class represented by as few as a single video.
Exiting FS-TAL methods assume trimmed training videos for new classes. However,
this setting is not only unnatural actions are typically captured in untrimmed
videos, but also ignores background video segments containing vital contextual
cues for foreground action segmentation. In this work, we first propose a new
FS-TAL setting by proposing to use untrimmed training videos. Further, a novel
FS-TAL model is proposed which maximizes the knowledge transfer from training
classes whilst enabling the model to be dynamically adapted to both the new
class and each video of that class simultaneously. This is achieved by
introducing a query adaptive Transformer in the model. Extensive experiments on
two action localization benchmarks demonstrate that our method can outperform
all the state of the art alternatives significantly in both single-domain and
cross-domain scenarios. The source code can be found in
https://github.com/sauradip/fewshotQAT
Related papers
- Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - UnLoc: A Unified Framework for Video Localization Tasks [82.59118972890262]
UnLoc is a new approach for temporal localization in untrimmed videos.
It uses pretrained image and text towers, and feeds tokens to a video-text fusion model.
We achieve state of the art results on all three different localization tasks with a unified approach.
arXiv Detail & Related papers (2023-08-21T22:15:20Z) - Multi-Modal Few-Shot Temporal Action Detection [157.96194484236483]
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection to new classes.
We introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD.
arXiv Detail & Related papers (2022-11-27T18:13:05Z) - Enabling Weakly-Supervised Temporal Action Localization from On-Device
Learning of the Video Stream [5.215681853828831]
We propose an efficient video learning approach to learn from a long, untrimmed streaming video.
To the best of our knowledge, we are the first attempt to directly learn from the on-device, long video stream.
arXiv Detail & Related papers (2022-08-25T13:41:03Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Learning to Localize Actions from Moments [153.54638582696128]
We introduce a new design of transfer learning type to learn action localization for a large set of action categories.
We present Action Herald Networks (AherNet) that integrate such design into an one-stage action localization framework.
arXiv Detail & Related papers (2020-08-31T16:03:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.