Towards Improving Spatiotemporal Action Recognition in Videos
- URL: http://arxiv.org/abs/2012.08097v1
- Date: Tue, 15 Dec 2020 05:21:50 GMT
- Title: Towards Improving Spatiotemporal Action Recognition in Videos
- Authors: Shentong Mo, Xiaoqing Tan, Jingfei Xia, Pinxu Ren
- Abstract summary: Motivated by the latest state-of-the-art real-time object detector You Only Watch Once (YOWO), we aim to modify its structure to increase action detection precision.
We propose four novel approaches in attempts to improve YOWO and address the imbalanced class issue in videos.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatiotemporal action recognition deals with locating and classifying actions
in videos. Motivated by the latest state-of-the-art real-time object detector
You Only Watch Once (YOWO), we aim to modify its structure to increase action
detection precision and reduce computational time. Specifically, we propose
four novel approaches in attempts to improve YOWO and address the imbalanced
class issue in videos by modifying the loss function. We consider two
moderate-sized datasets to apply our modification of YOWO - the popular
Joint-annotated Human Motion Data Base (J-HMDB-21) and a private dataset of
restaurant video footage provided by a Carnegie Mellon University-based
startup, Agot.AI. The latter involves fast-moving actions with small objects as
well as unbalanced data classes, making the task of action localization more
challenging. We implement our proposed methods in the GitHub repository
https://github.com/stoneMo/YOWOv2.
Related papers
- Harnessing Temporal Causality for Advanced Temporal Action Detection [53.654457142657236]
We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on benchmarks.
We ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, and 1st in the Moment Queries track at the Ego4D Challenge 2024.
arXiv Detail & Related papers (2024-07-25T06:03:02Z) - Towards Active Learning for Action Spotting in Association Football
Videos [59.84375958757395]
Analyzing football videos is challenging and requires identifying subtle and diverse-temporal patterns.
Current algorithms face significant challenges when learning from limited annotated data.
We propose an active learning framework that selects the most informative video samples to be annotated next.
arXiv Detail & Related papers (2023-04-09T11:50:41Z) - Multi-Task Learning of Object State Changes from Uncurated Videos [55.60442251060871]
We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos.
We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods.
We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
arXiv Detail & Related papers (2022-11-24T09:42:46Z) - Weakly Supervised Two-Stage Training Scheme for Deep Video Fight
Detection Model [0.0]
Fight detection in videos is an emerging deep learning application with today's prevalence of surveillance systems and streaming media.
Previous work has largely relied on action recognition techniques to tackle this problem.
We design the fight detection model as a composition of an action-aware feature extractor and an anomaly score generator.
arXiv Detail & Related papers (2022-09-23T08:29:16Z) - Look for the Change: Learning Object States and State-Modifying Actions
from Untrimmed Web Videos [55.60442251060871]
Human actions often induce changes of object states such as "cutting an apple" or "pouring coffee"
We develop a self-supervised model for jointly learning state-modifying actions together with the corresponding object states.
To cope with noisy uncurated training data, our model incorporates a noise adaptive weighting module supervised by a small number of annotated still images.
arXiv Detail & Related papers (2022-03-22T11:45:10Z) - Toward Accurate Person-level Action Recognition in Videos of Crowded
Scenes [131.9067467127761]
We focus on improving the action recognition by fully-utilizing the information of scenes and collecting new data.
Specifically, we adopt a strong human detector to detect spatial location of each frame.
We then apply action recognition models to learn thetemporal information from video frames on both the HIE dataset and new data with diverse scenes from the internet.
arXiv Detail & Related papers (2020-10-16T13:08:50Z) - Enhancing Unsupervised Video Representation Learning by Decoupling the
Scene and the Motion [86.56202610716504]
Action categories are highly related with the scene where the action happens, making the model tend to degrade to a solution where only the scene information is encoded.
We propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid.
arXiv Detail & Related papers (2020-09-12T09:54:11Z) - Spatiotemporal Action Recognition in Restaurant Videos [0.9176056742068814]
We analyze video footage of restaurant workers preparing food, for which potential applications include automated checkout and inventory management.
Such videos are quite different from the standardized datasets that researchers are used to, as they involve small objects, rapid actions, and notoriously unbalanced data classes.
In the first, we design and implement a novel, recurrent modification of YOLO using convolutional LSTMs and explore the various subtleties in the training of such a network.
In the second, we study the ability of YOWOs three dimensional convolutions to capture the unique features of our unique dataset.
arXiv Detail & Related papers (2020-08-25T16:30:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.