Enabling Weakly-Supervised Temporal Action Localization from On-Device
Learning of the Video Stream
- URL: http://arxiv.org/abs/2208.12673v1
- Date: Thu, 25 Aug 2022 13:41:03 GMT
- Title: Enabling Weakly-Supervised Temporal Action Localization from On-Device
Learning of the Video Stream
- Authors: Yue Tang, Yawen Wu, Peipei Zhou, and Jingtong Hu
- Abstract summary: We propose an efficient video learning approach to learn from a long, untrimmed streaming video.
To the best of our knowledge, we are the first attempt to directly learn from the on-device, long video stream.
- Score: 5.215681853828831
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting actions in videos have been widely applied in on-device
applications. Practical on-device videos are always untrimmed with both action
and background. It is desirable for a model to both recognize the class of
action and localize the temporal position where the action happens. Such a task
is called temporal action location (TAL), which is always trained on the cloud
where multiple untrimmed videos are collected and labeled. It is desirable for
a TAL model to continuously and locally learn from new data, which can directly
improve the action detection precision while protecting customers' privacy.
However, it is non-trivial to train a TAL model, since tremendous video samples
with temporal annotations are required. However, annotating videos frame by
frame is exorbitantly time-consuming and expensive. Although weakly-supervised
TAL (W-TAL) has been proposed to learn from untrimmed videos with only
video-level labels, such an approach is also not suitable for on-device
learning scenarios. In practical on-device learning applications, data are
collected in streaming. Dividing such a long video stream into multiple video
segments requires lots of human effort, which hinders the exploration of
applying the TAL tasks to realistic on-device learning applications. To enable
W-TAL models to learn from a long, untrimmed streaming video, we propose an
efficient video learning approach that can directly adapt to new environments.
We first propose a self-adaptive video dividing approach with a contrast
score-based segment merging approach to convert the video stream into multiple
segments. Then, we explore different sampling strategies on the TAL tasks to
request as few labels as possible. To the best of our knowledge, we are the
first attempt to directly learn from the on-device, long video stream.
Related papers
- Semi-supervised Active Learning for Video Action Detection [8.110693267550346]
We develop a novel semi-supervised active learning approach which utilizes both labeled as well as unlabeled data.
We evaluate the proposed approach on three different benchmark datasets, UCF-24-101, JHMDB-21, and Youtube-VOS.
arXiv Detail & Related papers (2023-12-12T11:13:17Z) - VideoCutLER: Surprisingly Simple Unsupervised Video Instance
Segmentation [87.13210748484217]
VideoCutLER is a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos.
We show the first competitive unsupervised learning results on the challenging YouTubeVIS 2019 benchmark, achieving 50.7% APvideo50.
VideoCutLER can also serve as a strong pretrained model for supervised video instance segmentation tasks, exceeding DINO by 15.9% on YouTubeVIS 2019 in terms of APvideo.
arXiv Detail & Related papers (2023-08-28T17:10:12Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Few-Shot Temporal Action Localization with Query Adaptive Transformer [105.84328176530303]
TAL works rely on a large number of training videos with exhaustive segment-level annotation.
Few-shot TAL aims to adapt a model to a new class represented by as few as a single video.
arXiv Detail & Related papers (2021-10-20T13:18:01Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Few-Shot Action Localization without Knowing Boundaries [9.959844922120523]
We show that it is possible to learn to localize actions in untrimmed videos when only one/few trimmed examples of the target action are available at test time.
We propose a network that learns to estimate Temporal Similarity Matrices (TSMs) that model a fine-grained similarity pattern between pairs of videos.
Our method achieves performance comparable or better to state-of-the-art fully-supervised, few-shot learning methods.
arXiv Detail & Related papers (2021-06-08T07:32:43Z) - Less is More: ClipBERT for Video-and-Language Learning via Sparse
Sampling [98.41300980759577]
A canonical approach to video-and-language learning dictates a neural model to learn from offline-extracted dense video features.
We propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks.
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms existing methods.
arXiv Detail & Related papers (2021-02-11T18:50:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.