Skimming and Scanning for Untrimmed Video Action Recognition
- URL: http://arxiv.org/abs/2104.10492v1
- Date: Wed, 21 Apr 2021 12:23:44 GMT
- Title: Skimming and Scanning for Untrimmed Video Action Recognition
- Authors: Yunyan Hong, Ailing Zeng, Min Li, Cewu Lu, Li Jiang, Qiang Xu
- Abstract summary: Untrimmed videos have redundant and diverse clips containing contextual information.
We propose a simple yet effective clip-level solution based on skim-scan techniques.
Our solution surpasses the state-of-the-art performance in terms of both accuracy and efficiency.
- Score: 44.70501912319826
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video action recognition (VAR) is a primary task of video understanding, and
untrimmed videos are more common in real-life scenes. Untrimmed videos have
redundant and diverse clips containing contextual information, so sampling
dense clips is essential. Recently, some works attempt to train a generic model
to select the N most representative clips. However, it is difficult to model
the complex relations from intra-class clips and inter-class videos within a
single model and fixed selected number, and the entanglement of multiple
relations is also hard to explain. Thus, instead of "only look once", we argue
"divide and conquer" strategy will be more suitable in untrimmed VAR. Inspired
by the speed reading mechanism, we propose a simple yet effective clip-level
solution based on skim-scan techniques. Specifically, the proposed Skim-Scan
framework first skims the entire video and drops those uninformative and
misleading clips. For the remaining clips, it scans clips with diverse features
gradually to drop redundant clips but cover essential content. The above
strategies can adaptively select the necessary clips according to the
difficulty of the different videos. To trade off the computational complexity
and performance, we observe the similar statistical expression between
lightweight and heavy networks, thus it supports us to explore the combination
of them. Comprehensive experiments are performed on ActivityNet and mini-FCVID
datasets, and results demonstrate that our solution surpasses the
state-of-the-art performance in terms of both accuracy and efficiency.
Related papers
- Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for
Long-form Video Understanding [57.917616284917756]
Real-world videos are often several minutes long with semantically consistent segments of variable length.
A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length.
This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative.
arXiv Detail & Related papers (2023-09-20T18:13:32Z) - Spatio-Temporal Crop Aggregation for Video Representation Learning [33.296154476701055]
Our model builds long-range video features by learning from sets of video clip-level features extracted with a pre-trained backbone.
We demonstrate that our video representation yields state-of-the-art performance with linear, non-linear, and $k$-NN probing on common action classification datasets.
arXiv Detail & Related papers (2022-11-30T14:43:35Z) - Semantic Video Moments Retrieval at Scale: A New Task and a Baseline [6.997674465889922]
Semantic Video Moments Retrieval at scale (SVMR) aims at finding relevant videos and re-localizing the video clips in them.
To address these challenges, we propose our two-stage baseline solution of candidate videos retrieval followed by a novel attention-based query-reference semantically alignment framework.
arXiv Detail & Related papers (2022-10-15T22:46:22Z) - Enabling Weakly-Supervised Temporal Action Localization from On-Device
Learning of the Video Stream [5.215681853828831]
We propose an efficient video learning approach to learn from a long, untrimmed streaming video.
To the best of our knowledge, we are the first attempt to directly learn from the on-device, long video stream.
arXiv Detail & Related papers (2022-08-25T13:41:03Z) - Transcript to Video: Efficient Clip Sequencing from Texts [65.87890762420922]
We present Transcript-to-Video -- a weakly-supervised framework that uses texts as input to automatically create video sequences from an extensive collection of shots.
Specifically, we propose a Content Retrieval Module and a Temporal Coherent Module to learn visual-language representations and model shot sequencing styles.
For fast inference, we introduce an efficient search strategy for real-time video clip sequencing.
arXiv Detail & Related papers (2021-07-25T17:24:50Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - Less is More: ClipBERT for Video-and-Language Learning via Sparse
Sampling [98.41300980759577]
A canonical approach to video-and-language learning dictates a neural model to learn from offline-extracted dense video features.
We propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks.
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms existing methods.
arXiv Detail & Related papers (2021-02-11T18:50:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.