Colar: Effective and Efficient Online Action Detection by Consulting
Exemplars
- URL: http://arxiv.org/abs/2203.01057v1
- Date: Wed, 2 Mar 2022 12:13:08 GMT
- Title: Colar: Effective and Efficient Online Action Detection by Consulting
Exemplars
- Authors: Le Yang, Junwei Han, Dingwen Zhang
- Abstract summary: We develop an effective exemplar-consultation mechanism that first measures the similarity between a frame and exemplary frames, and then aggregates exemplary features based on the similarity weights.
Due to the complementarity from the category-level modeling, our method employs a lightweight architecture but achieves new high performance on three benchmarks.
- Score: 102.28515426925621
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online action detection has attracted increasing research interests in recent
years. Current works model historical dependencies and anticipate future to
perceive the action evolution within a video segment and improve the detection
accuracy. However, the existing paradigm ignores category-level modeling and
does not pay sufficient attention to efficiency. Considering a category, its
representative frames exhibit various characteristics. Thus, the category-level
modeling can provide complementary guidance to the temporal dependencies
modeling. In this paper, we develop an effective exemplar-consultation
mechanism that first measures the similarity between a frame and exemplary
frames, and then aggregates exemplary features based on the similarity weights.
This is also an efficient mechanism as both similarity measurement and feature
aggregation require limited computations. Based on the exemplar-consultation
mechanism, the long-term dependencies can be captured by regarding historical
frames as exemplars, and the category-level modeling can be achieved by
regarding representative frames from a category as exemplars. Due to the
complementarity from the category-level modeling, our method employs a
lightweight architecture but achieves new high performance on three benchmarks.
In addition, using a spatio-temporal network to tackle video frames, our method
spends 9.8 seconds to dispose of a one-minute video and achieves comparable
performance.
Related papers
- Towards Unifying Feature Interaction Models for Click-Through Rate Prediction [19.149554121852724]
We propose a general framework called IPA to unify existing models.
We demonstrate that most existing models can be categorized within our framework by making specific choices for these three components.
We introduce a novel model that achieves competitive results compared to state-of-the-art CTR models.
arXiv Detail & Related papers (2024-11-19T12:04:02Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Temporal Saliency Query Network for Efficient Video Recognition [82.52760040577864]
Video recognition is a hot-spot research topic with the explosive growth of multimedia data on the Internet and mobile devices.
Most existing methods select the salient frames without awareness of the class-specific saliency scores.
We propose a novel Temporal Saliency Query (TSQ) mechanism, which introduces class-specific information to provide fine-grained cues for saliency measurement.
arXiv Detail & Related papers (2022-07-21T09:23:34Z) - CAD: Co-Adapting Discriminative Features for Improved Few-Shot
Classification [11.894289991529496]
Few-shot classification is a challenging problem that aims to learn a model that can adapt to unseen classes given a few labeled samples.
Recent approaches pre-train a feature extractor, and then fine-tune for episodic meta-learning.
We propose a strategy to cross-attend and re-weight discriminative features for few-shot classification.
arXiv Detail & Related papers (2022-03-25T06:14:51Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Modeling long-term interactions to enhance action recognition [81.09859029964323]
We propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels.
We use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects.
The proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks.
arXiv Detail & Related papers (2021-04-23T10:08:15Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Memory Group Sampling Based Online Action Recognition Using Kinetic
Skeleton Features [4.674689979981502]
We propose two core ideas to handle the online action recognition problem.
First, we combine the spatial and temporal skeleton features to depict the actions.
Second, we propose a memory group sampling method to combine the previous action frames and current action frames.
Third, an improved 1D CNN network is employed for training and testing using the features from sampled frames.
arXiv Detail & Related papers (2020-11-01T16:43:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.