Actions as Moving Points
- URL: http://arxiv.org/abs/2001.04608v3
- Date: Sat, 22 Aug 2020 14:45:35 GMT
- Title: Actions as Moving Points
- Authors: Yixuan Li, Zixu Wang, Limin Wang, Gangshan Wu
- Abstract summary: We present a conceptually simple, efficient, and more precise action tubelet detection framework, termed as MovingCenter Detector (MOC-detector)
Based on the insight that movement information could simplify and assist action tubelet detection, our MOC-detector is composed of three crucial head branches.
Our MOC-detector outperforms the existing state-of-the-art methods for both metrics of frame-mAP and video-mAP on the JHMDB and UCF101-24 datasets.
- Score: 66.21507857877756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The existing action tubelet detectors often depend on heuristic anchor design
and placement, which might be computationally expensive and sub-optimal for
precise localization. In this paper, we present a conceptually simple,
computationally efficient, and more precise action tubelet detection framework,
termed as MovingCenter Detector (MOC-detector), by treating an action instance
as a trajectory of moving points. Based on the insight that movement
information could simplify and assist action tubelet detection, our
MOC-detector is composed of three crucial head branches: (1) Center Branch for
instance center detection and action recognition, (2) Movement Branch for
movement estimation at adjacent frames to form trajectories of moving points,
(3) Box Branch for spatial extent detection by directly regressing bounding box
size at each estimated center. These three branches work together to generate
the tubelet detection results, which could be further linked to yield
video-level tubes with a matching strategy. Our MOC-detector outperforms the
existing state-of-the-art methods for both metrics of frame-mAP and video-mAP
on the JHMDB and UCF101-24 datasets. The performance gap is more evident for
higher video IoU, demonstrating that our MOC-detector is particularly effective
for more precise action detection. We provide the code at
https://github.com/MCG-NJU/MOC-Detector.
Related papers
- STMixer: A One-Stage Sparse Action Detector [43.62159663367588]
We propose two core designs for a more flexible one-stage action detector.
First, we sparse a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of features from the entire video-temporal domain.
Second, we devise a decoupled feature mixing module, which dynamically attends to mixes along the spatial and temporal dimensions respectively for better feature decoding.
arXiv Detail & Related papers (2024-04-15T14:52:02Z) - Cross-Cluster Shifting for Efficient and Effective 3D Object Detection
in Autonomous Driving [69.20604395205248]
We present a new 3D point-based detector model, named Shift-SSD, for precise 3D object detection in autonomous driving.
We introduce an intriguing Cross-Cluster Shifting operation to unleash the representation capacity of the point-based detector.
We conduct extensive experiments on the KITTI, runtime, and nuScenes datasets, and the results demonstrate the state-of-the-art performance of Shift-SSD.
arXiv Detail & Related papers (2024-03-10T10:36:32Z) - SeMoLi: What Moves Together Belongs Together [51.72754014130369]
We tackle semi-supervised object detection based on motion cues.
Recent results suggest that motion-based clustering methods can be used to pseudo-label instances of moving objects.
We re-think this approach and suggest that both, object detection, as well as motion-inspired pseudo-labeling, can be tackled in a data-driven manner.
arXiv Detail & Related papers (2024-02-29T18:54:53Z) - Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in
Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D.
At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules.
With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z) - An Efficient Spatio-Temporal Pyramid Transformer for Action Detection [40.68615998427292]
We present an efficient hierarchical Spatio-Temporal Pyramid Transformer (STPT) video framework for action detection.
Specifically, we propose to use local window attention to encode local-temporal rich-time representations in the early stages while applying global attention to capture long-term space-time dependencies in the later stages.
In this way, ourSTPT can encode both locality and dependency with largely reduced redundancy, delivering a promising trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2022-07-21T12:38:05Z) - Track, Check, Repeat: An EM Approach to Unsupervised Tracking [20.19397660306534]
We propose an unsupervised method for detecting and tracking moving objects in 3D, in unlabelled RGB-D videos.
We learn an ensemble of appearance-based 2D and 3D detectors, under heavy data augmentation.
We compare against existing unsupervised object discovery and tracking methods, using challenging videos from CATER and KITTI.
arXiv Detail & Related papers (2021-04-07T22:51:39Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action
Localization [42.95186231216036]
We propose Coarse-to-Fine Action Detector (CFAD) for efficient action localization.
CFAD first estimates coarse tubes-temporal action tubes from video streams, and then refines location based on key timestamps.
arXiv Detail & Related papers (2020-08-19T08:47:50Z) - Dense Scene Multiple Object Tracking with Box-Plane Matching [73.54369833671772]
Multiple Object Tracking (MOT) is an important task in computer vision.
We propose the Box-Plane Matching (BPM) method to improve the MOT performacne in dense scenes.
With the effectiveness of the three modules, our team achieves the 1st place on the Track-1 leaderboard in the ACM MM Grand Challenge HiEve 2020.
arXiv Detail & Related papers (2020-07-30T16:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.