ACT360: An Efficient 360-Degree Action Detection and Summarization Framework for Mission-Critical Training and Debriefing
- URL: http://arxiv.org/abs/2503.12852v1
- Date: Mon, 17 Mar 2025 06:12:36 GMT
- Title: ACT360: An Efficient 360-Degree Action Detection and Summarization Framework for Mission-Critical Training and Debriefing
- Authors: Aditi Tiwari, Klara Nahrstedt,
- Abstract summary: We introduce ACT360, a system that leverages 360-degree videos and machine learning for automated action detection and structured debriefing.<n>ACT360 integrates 360YOWO, an enhanced You Only Watch Once (YOWO) model with spatial attention and equirectangular-aware convolution (EAC) to mitigate panoramic video distortions.<n>We validate our approach on a publicly available dataset of 55 labeled 360-degree videos covering seven key operational actions.
- Score: 3.781421673607643
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Effective training and debriefing are critical in high-stakes, mission-critical environments such as disaster response, military simulations, and industrial safety, where precision and minimizing errors are paramount. The traditional post-training analysis relies on manually reviewing 2D videos, a time-consuming process that lacks comprehensive situational awareness. To address these limitations, we introduce ACT360, a system that leverages 360-degree videos and machine learning for automated action detection and structured debriefing. ACT360 integrates 360YOWO, an enhanced You Only Watch Once (YOWO) model with spatial attention and equirectangular-aware convolution (EAC) to mitigate panoramic video distortions. To enable deployment in resource-constrained environments, we apply quantization and model pruning, reducing the model size by 74% while maintaining robust accuracy (mAP drop of only 1.5%, from 0.865 to 0.850) and improving inference speed. We validate our approach on a publicly available dataset of 55 labeled 360-degree videos covering seven key operational actions, recorded across various real-world training sessions and environmental conditions. Additionally, ACT360 integrates 360AIE (Action Insight Explorer), a web-based interface for automatic action detection, retrieval, and textual summarization using large language models (LLMs), significantly enhancing post-incident analysis efficiency. ACT360 serves as a generalized framework for mission-critical debriefing, incorporating EAC, spatial attention, summarization, and model optimization. These innovations apply to any training environment requiring lightweight action detection and structured post-exercise analysis.
Related papers
- Resilience of Vision Transformers for Domain Generalisation in the Presence of Out-of-Distribution Noisy Images [2.2124795371148616]
We evaluate vision tramsformers pre-trained with masked image modelling (MIM) against synthetic out-of-distribution (OOD) benchmarks.
Experiments demonstrate BEIT's known robustness while maintaining 94% accuracy on PACS and 87% on Office-Home, despite significant occlusions.
These insights bridge the gap between lab-trained models and real-world deployment that offer a blueprint for building AI systems that generalise reliably under uncertainty.
arXiv Detail & Related papers (2025-04-05T16:25:34Z) - LEGO-Motion: Learning-Enhanced Grids with Occupancy Instance Modeling for Class-Agnostic Motion Prediction [12.071846486955627]
We introduce a novel occupancy-instance modeling framework for class-agnostic motion prediction tasks, named LEGO-Motion.<n>Our model comprises (1) a BEV encoder, (2) an Interaction-Augmented Instance, and (3) an Instance-Enhanced BEV.<n>Our method achieves state-of-the-art performance, outperforming existing approaches.
arXiv Detail & Related papers (2025-03-10T14:26:21Z) - Object-Centric Latent Action Learning [70.3173534658611]
We propose a novel object-centric latent action learning approach, based on VideoSaur and LAPO.<n>This method effectively disentangles causal agent-object interactions from irrelevant background noise and reduces the performance degradation caused by distractors.<n>Our preliminary experiments with the Distracting Control Suite show that latent action pretraining based on object decompositions improve the quality of inferred latent actions by x2.7 and efficiency of downstream fine-tuning with a small set of labeled actions, increasing return by x2.6 on average.
arXiv Detail & Related papers (2025-02-13T11:27:05Z) - INTACT: Inducing Noise Tolerance through Adversarial Curriculum Training for LiDAR-based Safety-Critical Perception and Autonomy [0.4124847249415279]
We present a novel framework designed to enhance the robustness of deep neural networks (DNNs) against noisy LiDAR data.<n>IntACT combines meta-learning with adversarial curriculum training (ACT) to address challenges posed by data corruption and sparsity in 3D point clouds.<n>IntACT's effectiveness is demonstrated through comprehensive evaluations on object detection, tracking, and classification benchmarks.
arXiv Detail & Related papers (2025-02-04T00:02:16Z) - Efficient Detection Framework Adaptation for Edge Computing: A Plug-and-play Neural Network Toolbox Enabling Edge Deployment [59.61554561979589]
Edge computing has emerged as a key paradigm for deploying deep learning-based object detection in time-sensitive scenarios.
Existing edge detection methods face challenges: difficulty balancing detection precision with lightweight models, limited adaptability, and insufficient real-world validation.
We propose the Edge Detection Toolbox (ED-TOOLBOX), which utilizes generalizable plug-and-play components to adapt object detection models for edge environments.
arXiv Detail & Related papers (2024-12-24T07:28:10Z) - Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - Sample Less, Learn More: Efficient Action Recognition via Frame Feature
Restoration [59.6021678234829]
We propose a novel method to restore the intermediate features for two sparsely sampled and adjacent video frames.
With the integration of our method, the efficiency of three commonly used baselines has been improved by over 50%, with a mere 0.5% reduction in recognition accuracy.
arXiv Detail & Related papers (2023-07-27T13:52:42Z) - For SALE: State-Action Representation Learning for Deep Reinforcement
Learning [60.42044715596703]
SALE is a novel approach for learning embeddings that model the nuanced interaction between state and action.
We integrate SALE and an adaptation of checkpoints for RL into TD3 to form the TD7 algorithm.
On OpenAI gym benchmark tasks, TD7 has an average performance gain of 276.7% and 50.7% over TD3 at 300k and 5M time steps, respectively.
arXiv Detail & Related papers (2023-06-04T19:47:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.