Weakly-Supervised Multi-Person Action Recognition in 360$^{\circ}$
Videos
- URL: http://arxiv.org/abs/2002.03266v1
- Date: Sun, 9 Feb 2020 02:17:46 GMT
- Title: Weakly-Supervised Multi-Person Action Recognition in 360$^{\circ}$
Videos
- Authors: Junnan Li, Jianquan Liu, Yongkang Wong, Shoji Nishimura, Mohan
Kankanhalli
- Abstract summary: We address the problem of action recognition in top-view 360$circ$ videos.
The proposed framework first transforms omnidirectional videos into panoramic videos, then it extracts spatial-temporal features using region-based 3D CNNs for action recognition.
We propose a weakly-supervised method based on multi-instance multi-label learning, which trains the model to recognize and localize multiple actions in a video using only video-level action labels as supervision.
- Score: 24.4517195084202
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent development of commodity 360$^{\circ}$ cameras have enabled a
single video to capture an entire scene, which endows promising potentials in
surveillance scenarios. However, research in omnidirectional video analysis has
lagged behind the hardware advances. In this work, we address the important
problem of action recognition in top-view 360$^{\circ}$ videos. Due to the wide
filed-of-view, 360$^{\circ}$ videos usually capture multiple people performing
actions at the same time. Furthermore, the appearance of people are deformed.
The proposed framework first transforms omnidirectional videos into panoramic
videos, then it extracts spatial-temporal features using region-based 3D CNNs
for action recognition. We propose a weakly-supervised method based on
multi-instance multi-label learning, which trains the model to recognize and
localize multiple actions in a video using only video-level action labels as
supervision. We perform experiments to quantitatively validate the efficacy of
the proposed method and qualitatively demonstrate action localization results.
To enable research in this direction, we introduce 360Action, the first
omnidirectional video dataset for multi-person action recognition.
Related papers
- Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction
Clips [38.02945794078731]
We tackle the task of reconstructing hand-object interactions from short video clips.
Our approach casts 3D inference as a per-video optimization and recovers a neural 3D representation of the object shape.
We empirically evaluate our approach on egocentric videos, and observe significant improvements over prior single-view and multi-view methods.
arXiv Detail & Related papers (2023-09-11T17:58:30Z) - Video-Specific Query-Key Attention Modeling for Weakly-Supervised
Temporal Action Localization [14.43055117008746]
Weakly-trimmed temporal action localization aims to identify and localize the action instances in the unsupervised videos with only video-level action labels.
We propose a network named VQK-Net with a video-specific query-key attention modeling that learns a unique query for each action category of each input video.
arXiv Detail & Related papers (2023-05-07T04:18:22Z) - ChatVideo: A Tracklet-centric Multimodal and Versatile Video
Understanding System [119.51012668709502]
We present our vision for multimodal and versatile video understanding and propose a prototype system, system.
Our system is built upon a tracklet-centric paradigm, which treats tracklets as the basic video unit.
All the detected tracklets are stored in a database and interact with the user through a database manager.
arXiv Detail & Related papers (2023-04-27T17:59:58Z) - Multi-Task Learning of Object State Changes from Uncurated Videos [55.60442251060871]
We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos.
We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods.
We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
arXiv Detail & Related papers (2022-11-24T09:42:46Z) - People Tracking in Panoramic Video for Guiding Robots [2.092922495279074]
A guiding robot aims to effectively bring people to and from specific places within environments that are possibly unknown to them.
During this operation the robot should be able to detect and track the accompanied person, trying never to lose sight of her/him.
A solution to minimize this event is to use an omnidirectional camera: its 360deg Field of View (FoV) guarantees that any framed object cannot leave the FoV if not occluded or very far from the sensor.
We propose a set of targeted methods that allow to effectively adapt to panoramic videos a standard people detection and tracking pipeline originally designed for perspective cameras
arXiv Detail & Related papers (2022-06-06T16:44:38Z) - Weakly-Supervised Action Detection Guided by Audio Narration [50.4318060593995]
We propose a model to learn from the narration supervision and utilize multimodal features, including RGB, motion flow, and ambient sound.
Our experiments show that noisy audio narration suffices to learn a good action detection model, thus reducing annotation expenses.
arXiv Detail & Related papers (2022-05-12T06:33:24Z) - E^2TAD: An Energy-Efficient Tracking-based Action Detector [78.90585878925545]
This paper presents a tracking-based solution to accurately and efficiently localize predefined key actions.
It won first place in the UAV-Video Track of 2021 Low-Power Computer Vision Challenge (LPCVC)
arXiv Detail & Related papers (2022-04-09T07:52:11Z) - Playable Environments: Video Manipulation in Space and Time [98.0621309257937]
We present Playable Environments - a new representation for interactive video generation and manipulation in space and time.
With a single image at inference time, our novel framework allows the user to move objects in 3D while generating a video by providing a sequence of desired actions.
Our method builds an environment state for each frame, which can be manipulated by our proposed action module and decoded back to the image space with volumetric rendering.
arXiv Detail & Related papers (2022-03-03T18:51:05Z) - A Comprehensive Study of Deep Video Action Recognition [35.7068977497202]
Video action recognition is one of the representative tasks for video understanding.
We provide a comprehensive survey of over 200 existing papers on deep learning for video action recognition.
arXiv Detail & Related papers (2020-12-11T18:54:08Z) - Gabriella: An Online System for Real-Time Activity Detection in
Untrimmed Security Videos [72.50607929306058]
We propose a real-time online system to perform activity detection on untrimmed security videos.
The proposed method consists of three stages: tubelet extraction, activity classification and online tubelet merging.
We demonstrate the effectiveness of the proposed approach in terms of speed (100 fps) and performance with state-of-the-art results.
arXiv Detail & Related papers (2020-04-23T22:20:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.