Related papers: LocATe: End-to-end Localization of Actions in 3D with Transformers

LocATe: End-to-end Localization of Actions in 3D with Transformers

URL: http://arxiv.org/abs/2203.10719v1
Date: Mon, 21 Mar 2022 03:35:32 GMT
Title: LocATe: End-to-end Localization of Actions in 3D with Transformers
Authors: Jiankai Sun, Bolei Zhou, Michael J. Black, Arjun Chandrasekaran
Abstract summary: LocATe is an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence. Unlike transformer-based object-detection and classification models which consider image or patch features as input, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence. We introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse.
Score: 91.28982770522329
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding a person's behavior from their 3D motion is a fundamental problem in computer vision with many applications. An important component of this problem is 3D Temporal Action Localization (3D-TAL), which involves recognizing what actions a person is performing, and when. State-of-the-art 3D-TAL methods employ a two-stage approach in which the action span detection task and the action recognition task are implemented as a cascade. This approach, however, limits the possibility of error-correction. In contrast, we propose LocATe, an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence. Further, unlike existing autoregressive models that focus on modeling the local context in a sequence, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence. Unlike transformer-based object-detection and classification models which consider image or patch features as input, the input in 3D-TAL is a long sequence of highly correlated frames. To handle the high-dimensional input, we implement an effective input representation, and overcome the diffuse attention across long time horizons by introducing sparse attention in the model. LocATe outperforms previous approaches on the existing PKU-MMD 3D-TAL benchmark (mAP=93.2%). Finally, we argue that benchmark datasets are most useful where there is clear room for performance improvement. To that end, we introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse. The dataset and code for the method will be available for research purposes.

Related papers

UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation. It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z)
Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream. At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank. To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z)
Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query. Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information. We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z)
SRT3D: A Sparse Region-Based 3D Object Tracking Approach for the Real World [10.029003607782878]
Region-based methods have become increasingly popular for model-based, monocular 3D tracking of texture-less objects in cluttered scenes. However, most methods are computationally expensive, requiring significant resources to run in real-time. We develop SRT3D, a sparse region-based approach to 3D object tracking that bridges this gap in efficiency.
arXiv Detail & Related papers (2021-10-25T07:58:18Z)
Improving 3D Object Detection with Channel-wise Transformer [58.668922561622466]
We propose a two-stage 3D object detection framework (CT3D) with minimal hand-crafted design. CT3D simultaneously performs proposal-aware embedding and channel-wise context aggregation. It achieves the AP of 81.77% in the moderate car category on the KITTI test 3D detection benchmark.
arXiv Detail & Related papers (2021-08-23T02:03:40Z)
Real-time Human Action Recognition Using Locally Aggregated Kinematic-Guided Skeletonlet and Supervised Hashing-by-Analysis Model [30.435850177921086]
3D action recognition suffers from three problems: highly complicated articulation, a great amount of noise, and a low implementation efficiency. We propose a real-time 3D action recognition framework by integrating the locally aggregated kinematic-guided skeletonlet (LAKS) with a supervised hashing-by-analysis (SHA) model. Experimental results on MSRAction3D, UTKinectAction3D and Florence3DAction datasets demonstrate that the proposed method outperforms state-of-the-art methods in both recognition accuracy and implementation efficiency.
arXiv Detail & Related papers (2021-05-24T14:46:40Z)
Learnable Online Graph Representations for 3D Multi-Object Tracking [156.58876381318402]
We propose a unified and learning based approach to the 3D MOT problem. We employ a Neural Message Passing network for data association that is fully trainable. We show the merit of the proposed approach on the publicly available nuScenes dataset by achieving state-of-the-art performance of 65.6% AMOTA and 58% fewer ID-switches.
arXiv Detail & Related papers (2021-04-23T17:59:28Z)
Efficient Spatialtemporal Context Modeling for Action Recognition [42.30158166919919]
We propose a recurrent 3D criss-cross attention (RCCA-3D) module to model the dense long-range contextual information video for action recognition. We model the relationship between points in the same line along the direction of horizon, vertical and depth at each time, which forms a 3D criss-cross structure. Compared with the non-local method, the proposed RCCA-3D module reduces the number of parameters and FLOPs by 25% and 11% for the video context modeling.
arXiv Detail & Related papers (2021-03-20T14:48:12Z)
A two-stage data association approach for 3D Multi-object Tracking [0.0]
We adapt a two-stage dataassociation method which was successful in image-based tracking to the 3D setting. Our method outperforms the baseline using one-stagebipartie matching for data association by achieving 0.587 AMOTA in NuScenes validation set.
arXiv Detail & Related papers (2021-01-21T15:50:17Z)
Relation3DMOT: Exploiting Deep Affinity for 3D Multi-Object Tracking from View Aggregation [8.854112907350624]
3D multi-object tracking plays a vital role in autonomous navigation. Many approaches detect objects in 2D RGB sequences for tracking, which is lack of reliability when localizing objects in 3D space. We propose a novel convolutional operation, named RelationConv, to better exploit the correlation between each pair of objects in the adjacent frames.
arXiv Detail & Related papers (2020-11-25T16:14:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.