MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection
- URL: http://arxiv.org/abs/2112.03902v1
- Date: Tue, 7 Dec 2021 18:57:37 GMT
- Title: MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection
- Authors: Rui Dai, Srijan Das, Kumara Kahatapitiya, Michael S. Ryoo, Francois
Bremond
- Abstract summary: Action detection is an essential and challenging task, especially for densely labelled datasets of untrimmed videos.
We propose a novel ConvTransformer network for action detection that efficiently captures both short-term and long-term temporal information.
Our network outperforms the state-of-the-art methods on all three datasets.
- Score: 37.25262046781015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Action detection is an essential and challenging task, especially for densely
labelled datasets of untrimmed videos. The temporal relation is complex in
those datasets, including challenges like composite action, and co-occurring
action. For detecting actions in those complex videos, efficiently capturing
both short-term and long-term temporal information in the video is critical. To
this end, we propose a novel ConvTransformer network for action detection. This
network comprises three main components: (1) Temporal Encoder module
extensively explores global and local temporal relations at multiple temporal
resolutions. (2) Temporal Scale Mixer module effectively fuses the multi-scale
features to have a unified feature representation. (3) Classification module is
used to learn the instance center-relative position and predict the frame-level
classification scores. The extensive experiments on multiple datasets,
including Charades, TSU and MultiTHUMOS, confirm the effectiveness of our
proposed method. Our network outperforms the state-of-the-art methods on all
three datasets.
Related papers
- PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest [65.48057241587398]
PoIFusion is a framework to fuse information of RGB images and LiDAR point clouds at the points of interest (PoIs)
Our approach maintains the view of each modality and obtains multi-modal features by computation-friendly projection and computation.
We conducted extensive experiments on nuScenes and Argoverse2 datasets to evaluate our approach.
arXiv Detail & Related papers (2024-03-14T09:28:12Z) - SWTF: Sparse Weighted Temporal Fusion for Drone-Based Activity
Recognition [2.7677069267434873]
Drone-camera based human activity recognition (HAR) has received significant attention from the computer vision research community.
We propose a novel Sparse Weighted Temporal Fusion (SWTF) module to utilize sparsely sampled video frames.
The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets.
arXiv Detail & Related papers (2022-11-10T12:45:43Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z) - CTRN: Class-Temporal Relational Network for Action Detection [7.616556723260849]
We introduce an end-to-end network: Class-Temporal Network (CTRN)
CTRN contains three key components: The Transform Representation Module, the Class-Temporal Module and the G-classifier.
We evaluate CTR on three densely labelled datasets and achieve state-of-the-art performance.
arXiv Detail & Related papers (2021-10-26T08:15:47Z) - Improving Video Instance Segmentation via Temporal Pyramid Routing [61.10753640148878]
Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
arXiv Detail & Related papers (2021-07-28T03:57:12Z) - Multi-Level Temporal Pyramid Network for Action Detection [47.223376232616424]
We propose a Multi-Level Temporal Network (MN) to improve the discrimination of the features.
By this means, the proposed MN can learn rich and discriminative features for different action instances with different durations.
We evaluate MN on two challenging datasets: THUMOS'14 and Activitynet v1.3, and the experimental results show that MN obtains competitive performance on Activitynet v1.3 and outperforms the state-of-the-art approaches on THUMOS'14 significantly.
arXiv Detail & Related papers (2020-08-07T17:08:24Z) - Segment as Points for Efficient Online Multi-Object Tracking and
Segmentation [66.03023110058464]
We propose a highly effective method for learning instance embeddings based on segments by converting the compact image representation to un-ordered 2D point cloud representation.
Our method generates a new tracking-by-points paradigm where discriminative instance embeddings are learned from randomly selected points rather than images.
The resulting online MOTS framework, named PointTrack, surpasses all the state-of-the-art methods by large margins.
arXiv Detail & Related papers (2020-07-03T08:29:35Z) - Temporal Pyramid Network for Action Recognition [129.12076009042622]
We propose a generic Temporal Pyramid Network (TPN) at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks.
TPN shows consistent improvements over other challenging baselines on several action recognition datasets.
arXiv Detail & Related papers (2020-04-07T17:17:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.