MM-SEAL: A Large-scale Video Dataset of Multi-person Multi-grained Spatio-temporally Action Localization
- URL: http://arxiv.org/abs/2204.02688v2
- Date: Wed, 27 Nov 2024 06:14:47 GMT
- Title: MM-SEAL: A Large-scale Video Dataset of Multi-person Multi-grained Spatio-temporally Action Localization
- Authors: Shimin Chen, Wei Li, Chen Chen, Jianyang Gu, Jiaming Chu, Xunqiang Tao, Yandong Guo,
- Abstract summary: We are the first to propose a new benchmark for multi-person complex activity localization.<n>We observe that limited atomic actions can be combined into many complex activities.<n> MM-SEAL provides both atomic action and complex activity annotations, producing 111.7k atomic actions spanning 172 action categories and 17.7k complex activities spanning 200 activity categories.
- Score: 19.721688276051363
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce a novel large-scale video dataset dubbed MM-SEAL for multi-person multi-grained spatio-temporal action localization among human daily life. We are the first to propose a new benchmark for multi-person spatio-temporal complex activity localization, where complex semantic and long duration bring new challenges to localization tasks. We observe that limited atomic actions can be combined into many complex activities. MM-SEAL provides both atomic action and complex activity annotations, producing 111.7k atomic actions spanning 172 action categories and 17.7k complex activities spanning 200 activity categories. We explore the relationship between atomic actions and complex activities, finding that atomic action features can improve the complex activity localization performance. Also, we propose a new network which generates temporal proposals and labels simultaneously, termed Faster-TAD. Finally, our evaluations show that visual features pretrained on MM-SEAL can improve the performance on other action localization benchmarks. We will release the dataset and the project code upon publication of the paper.
Related papers
- RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba [86.47790050206306]
RefAVA++ comprises >2.9 million frames and >75.1k annotated persons in total.<n> RefAtomNet++ advances cross-modal token aggregation through a multi-hierarchical semantic-aligned cross-attention mechanism.<n>Experiments show that RefAtomNet++ establishes new state-of-the-art results.
arXiv Detail & Related papers (2025-10-18T10:41:19Z) - ATARS: An Aerial Traffic Atomic Activity Recognition and Temporal Segmentation Dataset [11.07193206318681]
We introduce the Aerial Traffic Atomic Activity Recognition and (ATARS) dataset, the first aerial dataset designed for multi-label atomic activity analysis.
We offer atomic activity labels for each frame, which accurately record the intervals for traffic activities.
We propose a novel task, Multi-label trimmed Atomic Activity Recognition, enabling the study of accurate temporal localization for atomic activity.
arXiv Detail & Related papers (2025-03-24T11:06:04Z) - Technical Report for ActivityNet Challenge 2022 -- Temporal Action Localization [20.268572246761895]
We propose to locate the temporal boundaries of each action and predict action class in untrimmed videos.
Faster-TAD simplifies the pipeline of TAD and gets remarkable performance.
arXiv Detail & Related papers (2024-10-31T14:16:56Z) - Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild [66.34146236875822]
The Nymeria dataset is a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices.
It contains 1200 recordings of 300 hours of daily activities from 264 participants across 50 locations, travelling a total of 399Km.
The motion-language descriptions provide 310.5K sentences in 8.64M words from a vocabulary size of 6545.
arXiv Detail & Related papers (2024-06-14T10:23:53Z) - Temporal Grounding of Activities using Multimodal Large Language Models [0.0]
We evaluate the effectiveness of combining image-based and text-based large language models (LLMs) in a two-stage approach for temporal activity localization.
We demonstrate that our method outperforms existing video-based LLMs.
arXiv Detail & Related papers (2024-05-30T09:11:02Z) - Learning to Refactor Action and Co-occurrence Features for Temporal
Action Localization [74.74339878286935]
Action features and co-occurrence features often dominate the actual action content in videos.
We develop a novel auxiliary task by decoupling these two types of features within a video snippet.
We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features.
arXiv Detail & Related papers (2022-06-23T06:30:08Z) - Off-Beat Multi-Agent Reinforcement Learning [62.833358249873704]
We investigate model-free multi-agent reinforcement learning (MARL) in environments where off-beat actions are prevalent.
We propose a novel episodic memory, LeGEM, for model-free MARL algorithms.
We evaluate LeGEM on various multi-agent scenarios with off-beat actions, including Stag-Hunter Game, Quarry Game, Afforestation Game, and StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2022-05-27T02:21:04Z) - Towards High-Quality Temporal Action Detection with Sparse Proposals [14.923321325749196]
Temporal Action Detection aims to localize the temporal segments containing human action instances and predict the action categories.
We introduce Sparse Proposals to interact with the hierarchical features.
Experiments demonstrate the effectiveness of our method, especially under high tIoU thresholds.
arXiv Detail & Related papers (2021-09-18T06:15:19Z) - Temporal Action Segmentation with High-level Complex Activity Labels [29.17792724210746]
We learn the action segments taking only the high-level activity labels as input.
We propose a novel action discovery framework that automatically discovers constituent actions in videos.
arXiv Detail & Related papers (2021-08-15T09:50:42Z) - TinyAction Challenge: Recognizing Real-world Low-resolution Activities
in Videos [45.025522742972505]
This paper summarizes the TinyAction challenge which was organized in ActivityNet workshop at CVPR 2021.
This challenge focuses on recognizing real-world low-resolution activities present in videos.
arXiv Detail & Related papers (2021-07-24T00:41:19Z) - JRDB-Act: A Large-scale Multi-modal Dataset for Spatio-temporal Action,
Social Group and Activity Detection [54.696819174421584]
We introduce JRDB-Act, a multi-modal dataset that reflects a real distribution of human daily life actions in a university campus environment.
JRDB-Act has been densely annotated with atomic actions, comprises over 2.8M action labels.
JRDB-Act comes with social group identification annotations conducive to the task of grouping individuals based on their interactions in the scene.
arXiv Detail & Related papers (2021-06-16T14:43:46Z) - FineAction: A Fined Video Dataset for Temporal Action Localization [60.90129329728657]
FineAction is a new large-scale fined video dataset collected from existing video datasets and web videos.
This dataset contains 139K fined action instances densely annotated in almost 17K untrimmed videos spanning 106 action categories.
Experimental results reveal that our FineAction brings new challenges for action localization on fined and multi-label instances with shorter duration.
arXiv Detail & Related papers (2021-05-24T06:06:32Z) - Semi-Supervised Few-Shot Atomic Action Recognition [59.587738451616495]
We propose a novel model for semi-supervised few-shot atomic action recognition.
Our model features unsupervised and contrastive video embedding, loose action alignment, multi-head feature comparison, and attention-based aggregation.
Experiments show that our model can attain high accuracy on representative atomic action datasets outperforming their respective state-of-the-art classification accuracy in full supervision setting.
arXiv Detail & Related papers (2020-11-17T03:59:05Z) - Complementary Boundary Generator with Scale-Invariant Relation Modeling
for Temporal Action Localization: Submission to ActivityNet Challenge 2020 [66.4527310659592]
This report presents an overview of our solution used in the submission to ActivityNet Challenge 2020 Task 1.
We decouple the temporal action localization task into two stages (i.e. proposal generation and classification) and enrich the proposal diversity.
Our proposed scheme achieves the state-of-the-art performance on the temporal action localization task with textbf42.26 average mAP on the challenge testing set.
arXiv Detail & Related papers (2020-07-20T04:35:40Z) - Adversarial Background-Aware Loss for Weakly-supervised Temporal
Activity Localization [40.517438760096056]
Temporally localizing activities within untrimmed videos has been extensively studied in recent years.
Despite recent advances, existing methods for weakly-supervised temporal activity localization struggle to recognize when an activity is not occurring.
arXiv Detail & Related papers (2020-07-13T19:33:24Z) - Inferring Temporal Compositions of Actions Using Probabilistic Automata [61.09176771931052]
We propose to express temporal compositions of actions as semantic regular expressions and derive an inference framework using probabilistic automata.
Our approach is different from existing works that either predict long-range complex activities as unordered sets of atomic actions, or retrieve videos using natural language sentences.
arXiv Detail & Related papers (2020-04-28T00:15:26Z) - Gabriella: An Online System for Real-Time Activity Detection in
Untrimmed Security Videos [72.50607929306058]
We propose a real-time online system to perform activity detection on untrimmed security videos.
The proposed method consists of three stages: tubelet extraction, activity classification and online tubelet merging.
We demonstrate the effectiveness of the proposed approach in terms of speed (100 fps) and performance with state-of-the-art results.
arXiv Detail & Related papers (2020-04-23T22:20:10Z) - Spatio-Temporal Action Detection with Multi-Object Interaction [127.85524354900494]
In this paper, we study the S-temporal action detection problem with multi-object interaction.
We introduce a new dataset that is spatially annotated with action tubes containing multi-object interactions.
We propose an end-to-endtemporal action detection model that performs both spatial and temporal regression simultaneously.
arXiv Detail & Related papers (2020-04-01T00:54:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.