Weakly-Supervised Multi-Person Action Recognition in 360$^{\circ}$
Videos
- URL: http://arxiv.org/abs/2002.03266v1
- Date: Sun, 9 Feb 2020 02:17:46 GMT
- Title: Weakly-Supervised Multi-Person Action Recognition in 360$^{\circ}$
Videos
- Authors: Junnan Li, Jianquan Liu, Yongkang Wong, Shoji Nishimura, Mohan
Kankanhalli
- Abstract summary: We address the problem of action recognition in top-view 360$circ$ videos.
The proposed framework first transforms omnidirectional videos into panoramic videos, then it extracts spatial-temporal features using region-based 3D CNNs for action recognition.
We propose a weakly-supervised method based on multi-instance multi-label learning, which trains the model to recognize and localize multiple actions in a video using only video-level action labels as supervision.
- Score: 24.4517195084202
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent development of commodity 360$^{\circ}$ cameras have enabled a
single video to capture an entire scene, which endows promising potentials in
surveillance scenarios. However, research in omnidirectional video analysis has
lagged behind the hardware advances. In this work, we address the important
problem of action recognition in top-view 360$^{\circ}$ videos. Due to the wide
filed-of-view, 360$^{\circ}$ videos usually capture multiple people performing
actions at the same time. Furthermore, the appearance of people are deformed.
The proposed framework first transforms omnidirectional videos into panoramic
videos, then it extracts spatial-temporal features using region-based 3D CNNs
for action recognition. We propose a weakly-supervised method based on
multi-instance multi-label learning, which trains the model to recognize and
localize multiple actions in a video using only video-level action labels as
supervision. We perform experiments to quantitatively validate the efficacy of
the proposed method and qualitatively demonstrate action localization results.
To enable research in this direction, we introduce 360Action, the first
omnidirectional video dataset for multi-person action recognition.
Related papers
- Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos [66.1935609072708]
Key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is.
We propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels.
During inference, our model takes as input only a multi-view video -- no language or camera poses -- and returns the best viewpoint to watch at each timestep.
arXiv Detail & Related papers (2024-11-13T16:31:08Z) - Action Selection Learning for Multi-label Multi-view Action Recognition [2.8266810371534152]
This study focuses on real-world scenarios where cameras are distributed to capture a wide-range area with only weak labels available at the video-level.
We propose the method named Multi-view Action Selection Learning (MultiASL), which leverages action selection learning to enhance view fusion.
Experiments in a real-world office environment using the MM-Office dataset demonstrate the superior performance of the proposed method compared to existing methods.
arXiv Detail & Related papers (2024-10-04T10:36:22Z) - Video-Specific Query-Key Attention Modeling for Weakly-Supervised
Temporal Action Localization [14.43055117008746]
Weakly-trimmed temporal action localization aims to identify and localize the action instances in the unsupervised videos with only video-level action labels.
We propose a network named VQK-Net with a video-specific query-key attention modeling that learns a unique query for each action category of each input video.
arXiv Detail & Related papers (2023-05-07T04:18:22Z) - ChatVideo: A Tracklet-centric Multimodal and Versatile Video
Understanding System [119.51012668709502]
We present our vision for multimodal and versatile video understanding and propose a prototype system, system.
Our system is built upon a tracklet-centric paradigm, which treats tracklets as the basic video unit.
All the detected tracklets are stored in a database and interact with the user through a database manager.
arXiv Detail & Related papers (2023-04-27T17:59:58Z) - Multi-Task Learning of Object State Changes from Uncurated Videos [55.60442251060871]
We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos.
We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods.
We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
arXiv Detail & Related papers (2022-11-24T09:42:46Z) - Weakly-Supervised Action Detection Guided by Audio Narration [50.4318060593995]
We propose a model to learn from the narration supervision and utilize multimodal features, including RGB, motion flow, and ambient sound.
Our experiments show that noisy audio narration suffices to learn a good action detection model, thus reducing annotation expenses.
arXiv Detail & Related papers (2022-05-12T06:33:24Z) - E^2TAD: An Energy-Efficient Tracking-based Action Detector [78.90585878925545]
This paper presents a tracking-based solution to accurately and efficiently localize predefined key actions.
It won first place in the UAV-Video Track of 2021 Low-Power Computer Vision Challenge (LPCVC)
arXiv Detail & Related papers (2022-04-09T07:52:11Z) - Playable Environments: Video Manipulation in Space and Time [98.0621309257937]
We present Playable Environments - a new representation for interactive video generation and manipulation in space and time.
With a single image at inference time, our novel framework allows the user to move objects in 3D while generating a video by providing a sequence of desired actions.
Our method builds an environment state for each frame, which can be manipulated by our proposed action module and decoded back to the image space with volumetric rendering.
arXiv Detail & Related papers (2022-03-03T18:51:05Z) - A Comprehensive Study of Deep Video Action Recognition [35.7068977497202]
Video action recognition is one of the representative tasks for video understanding.
We provide a comprehensive survey of over 200 existing papers on deep learning for video action recognition.
arXiv Detail & Related papers (2020-12-11T18:54:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.