Related papers: TadML: A fast temporal action detection with Mechanics-MLP

TadML: A fast temporal action detection with Mechanics-MLP

URL: http://arxiv.org/abs/2206.02997v2
Date: Fri, 2 Feb 2024 17:11:10 GMT
Title: TadML: A fast temporal action detection with Mechanics-MLP
Authors: Bowen Deng and Dongchang Liu
Abstract summary: Temporal Action Detection (TAD) is a crucial but challenging task in video understanding. Most current models adopt both RGB and Optical-Flow streams for the TAD task. We propose a one-stage anchor-free temporal localization method with RGB stream only, in which a novel Newtonian Mechanics-MLP architecture is established.
Score: 0.5148939336441986
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Temporal Action Detection(TAD) is a crucial but challenging task in video understanding.It is aimed at detecting both the type and start-end frame for each action instance in a long, untrimmed video.Most current models adopt both RGB and Optical-Flow streams for the TAD task. Thus, original RGB frames must be converted manually into Optical-Flow frames with additional computation and time cost, which is an obstacle to achieve real-time processing. At present, many models adopt two-stage strategies, which would slow the inference speed down and complicatedly tuning on proposals generating.By comparison, we propose a one-stage anchor-free temporal localization method with RGB stream only, in which a novel Newtonian Mechanics-MLP architecture is established. It has comparable accuracy with all existing state-of-the-art models, while surpasses the inference speed of these methods by a large margin. The typical inference speed in this paper is astounding 4.44 video per second on THUMOS14. In applications, because there is no need to convert optical flow, the inference speed will be faster.It also proves that MLP has great potential in downstream tasks such as TAD. The source code is available at https://github.com/BonedDeng/TadML

Related papers

TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation [4.261090951843438]
Video Frame Interpolation (VFI) aims to predict the intermediate frame $I_n$ based on two consecutive neighboring frames.<n>Recent approaches apply diffusion models (both image-based and video-based) in this task and achieve strong performance.<n>We propose Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation (TLB-VFI), an efficient video-based diffusion model.
arXiv Detail & Related papers (2025-07-07T13:25:32Z)
MemFlow: Optical Flow Estimation and Prediction with Memory [54.22820729477756]
We present MemFlow, a real-time method for optical flow estimation and prediction with memory. Our method enables memory read-out and update modules for aggregating historical motion information in real-time. Our approach seamlessly extends to the future prediction of optical flow based on past observations.
arXiv Detail & Related papers (2024-04-07T04:56:58Z)
ATCA: an Arc Trajectory Based Model with Curvature Attention for Video Frame Interpolation [10.369068266836154]
We propose an arc trajectory based model (ATCA) which learns motion prior to only two consecutive frames and also is lightweight. Experiments show that our approach performs better than many SOTA methods with fewer parameters and faster inference speed.
arXiv Detail & Related papers (2022-08-01T13:42:08Z)
StreamYOLO: Real-time Object Detection for Streaming Perception [84.2559631820007]
We endow the models with the capacity of predicting the future, significantly improving the results for streaming perception. We consider multiple velocities driving scene and propose Velocity-awared streaming AP (VsAP) to jointly evaluate the accuracy. Our simple method achieves the state-of-the-art performance on Argoverse-HD dataset and improves the sAP and VsAP by 4.7% and 8.2% respectively.
arXiv Detail & Related papers (2022-07-21T12:03:02Z)
RGB Stream Is Enough for Temporal Action Detection [3.2689702143620147]
State-of-the-art temporal action detectors to date are based on two-stream input including RGB frames and optical flow. optical flow is a hand-designed representation which not only requires heavy computation, but also makes it methodologically unsatisfactory that two-stream methods are often not learned end-to-end jointly with the flow. We argue that optical flow is dispensable in high-accuracy temporal action detection and image level data augmentation is the key solution to avoid performance degradation when optical flow is removed.
arXiv Detail & Related papers (2021-07-09T11:10:11Z)
FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks. Current networks often occupy large number of parameters and require heavy computation costs. Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z)
PAN: Towards Fast Action Recognition via Learning Persistence of Appearance [60.75488333935592]
Most state-of-the-art methods heavily rely on dense optical flow as motion representation. In this paper, we shed light on fast action recognition by lifting the reliance on optical flow. We design a novel motion cue called Persistence of Appearance (PA) In contrast to optical flow, our PA focuses more on distilling the motion information at boundaries.
arXiv Detail & Related papers (2020-08-08T07:09:54Z)
Approximated Bilinear Modules for Temporal Modeling [116.6506871576514]
Two-layers in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch sampling. Our models can outperform most state-of-the-art methods on SomethingSomething v1 and v2 datasets without pretraining.
arXiv Detail & Related papers (2020-07-25T09:07:35Z)
A Real-time Action Representation with Temporal Encoding and Deep Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation. T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed. Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.