Related papers: Weakly-Supervised Multi-Person Action Recognition in 360$^{\circ}$ Videos

Weakly-Supervised Multi-Person Action Recognition in 360$^{\circ}$ Videos

URL: http://arxiv.org/abs/2002.03266v1
Date: Sun, 9 Feb 2020 02:17:46 GMT
Title: Weakly-Supervised Multi-Person Action Recognition in 360$^{\circ}$ Videos
Authors: Junnan Li, Jianquan Liu, Yongkang Wong, Shoji Nishimura, Mohan Kankanhalli
Abstract summary: We address the problem of action recognition in top-view 360$circ$ videos. The proposed framework first transforms omnidirectional videos into panoramic videos, then it extracts spatial-temporal features using region-based 3D CNNs for action recognition. We propose a weakly-supervised method based on multi-instance multi-label learning, which trains the model to recognize and localize multiple actions in a video using only video-level action labels as supervision.
Score: 24.4517195084202
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent development of commodity 360$^{\circ}$ cameras have enabled a single video to capture an entire scene, which endows promising potentials in surveillance scenarios. However, research in omnidirectional video analysis has lagged behind the hardware advances. In this work, we address the important problem of action recognition in top-view 360$^{\circ}$ videos. Due to the wide filed-of-view, 360$^{\circ}$ videos usually capture multiple people performing actions at the same time. Furthermore, the appearance of people are deformed. The proposed framework first transforms omnidirectional videos into panoramic videos, then it extracts spatial-temporal features using region-based 3D CNNs for action recognition. We propose a weakly-supervised method based on multi-instance multi-label learning, which trains the model to recognize and localize multiple actions in a video using only video-level action labels as supervision. We perform experiments to quantitatively validate the efficacy of the proposed method and qualitatively demonstrate action localization results. To enable research in this direction, we introduce 360Action, the first omnidirectional video dataset for multi-person action recognition.

Related papers

Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos [64.10180665546237]
360deg videos offer a more complete perspective of our surroundings. Existing video models excel at producing standard videos, but their ability to generate full panoramic videos remains elusive. We develop a high-quality data filtering pipeline to curate pairwise training data and improve the quality of 360deg video generation. Experimental results demonstrate that our model can generate realistic and coherent 360deg videos from in-the-wild perspective video.
arXiv Detail & Related papers (2025-04-10T17:51:38Z)
Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos [66.1935609072708]
Key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. We propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. During inference, our model takes as input only a multi-view video -- no language or camera poses -- and returns the best viewpoint to watch at each timestep.
arXiv Detail & Related papers (2024-11-13T16:31:08Z)
Action Selection Learning for Multi-label Multi-view Action Recognition [2.8266810371534152]
This study focuses on real-world scenarios where cameras are distributed to capture a wide-range area with only weak labels available at the video-level. We propose the method named Multi-view Action Selection Learning (MultiASL), which leverages action selection learning to enhance view fusion. Experiments in a real-world office environment using the MM-Office dataset demonstrate the superior performance of the proposed method compared to existing methods.
arXiv Detail & Related papers (2024-10-04T10:36:22Z)
Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips [38.02945794078731]
We tackle the task of reconstructing hand-object interactions from short video clips. Our approach casts 3D inference as a per-video optimization and recovers a neural 3D representation of the object shape. We empirically evaluate our approach on egocentric videos, and observe significant improvements over prior single-view and multi-view methods.
arXiv Detail & Related papers (2023-09-11T17:58:30Z)
Video-Specific Query-Key Attention Modeling for Weakly-Supervised Temporal Action Localization [14.43055117008746]
Weakly-trimmed temporal action localization aims to identify and localize the action instances in the unsupervised videos with only video-level action labels. We propose a network named VQK-Net with a video-specific query-key attention modeling that learns a unique query for each action category of each input video.
arXiv Detail & Related papers (2023-05-07T04:18:22Z)
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System [119.51012668709502]
We present our vision for multimodal and versatile video understanding and propose a prototype system, system. Our system is built upon a tracklet-centric paradigm, which treats tracklets as the basic video unit. All the detected tracklets are stored in a database and interact with the user through a database manager.
arXiv Detail & Related papers (2023-04-27T17:59:58Z)
Multi-Task Learning of Object State Changes from Uncurated Videos [55.60442251060871]
We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos. We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods. We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
arXiv Detail & Related papers (2022-11-24T09:42:46Z)
Weakly-Supervised Action Detection Guided by Audio Narration [50.4318060593995]
We propose a model to learn from the narration supervision and utilize multimodal features, including RGB, motion flow, and ambient sound. Our experiments show that noisy audio narration suffices to learn a good action detection model, thus reducing annotation expenses.
arXiv Detail & Related papers (2022-05-12T06:33:24Z)
E^2TAD: An Energy-Efficient Tracking-based Action Detector [78.90585878925545]
This paper presents a tracking-based solution to accurately and efficiently localize predefined key actions. It won first place in the UAV-Video Track of 2021 Low-Power Computer Vision Challenge (LPCVC)
arXiv Detail & Related papers (2022-04-09T07:52:11Z)
Playable Environments: Video Manipulation in Space and Time [98.0621309257937]
We present Playable Environments - a new representation for interactive video generation and manipulation in space and time. With a single image at inference time, our novel framework allows the user to move objects in 3D while generating a video by providing a sequence of desired actions. Our method builds an environment state for each frame, which can be manipulated by our proposed action module and decoded back to the image space with volumetric rendering.
arXiv Detail & Related papers (2022-03-03T18:51:05Z)
A Comprehensive Study of Deep Video Action Recognition [35.7068977497202]
Video action recognition is one of the representative tasks for video understanding. We provide a comprehensive survey of over 200 existing papers on deep learning for video action recognition.
arXiv Detail & Related papers (2020-12-11T18:54:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.