Action Selection Learning for Multi-label Multi-view Action Recognition
- URL: http://arxiv.org/abs/2410.03302v3
- Date: Fri, 18 Oct 2024 00:46:30 GMT
- Title: Action Selection Learning for Multi-label Multi-view Action Recognition
- Authors: Trung Thanh Nguyen, Yasutomo Kawanishi, Takahiro Komamizu, Ichiro Ide,
- Abstract summary: This study focuses on real-world scenarios where cameras are distributed to capture a wide-range area with only weak labels available at the video-level.
We propose the method named Multi-view Action Selection Learning (MultiASL), which leverages action selection learning to enhance view fusion.
Experiments in a real-world office environment using the MM-Office dataset demonstrate the superior performance of the proposed method compared to existing methods.
- Score: 2.8266810371534152
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Multi-label multi-view action recognition aims to recognize multiple concurrent or sequential actions from untrimmed videos captured by multiple cameras. Existing work has focused on multi-view action recognition in a narrow area with strong labels available, where the onset and offset of each action are labeled at the frame-level. This study focuses on real-world scenarios where cameras are distributed to capture a wide-range area with only weak labels available at the video-level. We propose the method named Multi-view Action Selection Learning (MultiASL), which leverages action selection learning to enhance view fusion by selecting the most useful information from different viewpoints. The proposed method includes a Multi-view Spatial-Temporal Transformer video encoder to extract spatial and temporal features from multi-viewpoint videos. Action Selection Learning is employed at the frame-level, using pseudo ground-truth obtained from weak labels at the video-level, to identify the most relevant frames for action recognition. Experiments in a real-world office environment using the MM-Office dataset demonstrate the superior performance of the proposed method compared to existing methods. The source code is available at https://github.com/thanhhff/MultiASL/.
Related papers
- Matching Anything by Segmenting Anything [109.2507425045143]
We propose MASA, a novel method for robust instance association learning.
MASA learns instance-level correspondence through exhaustive data transformations.
We show that MASA achieves even better performance than state-of-the-art methods trained with fully annotated in-domain video sequences.
arXiv Detail & Related papers (2024-06-06T16:20:07Z) - Hypergraph-based Multi-View Action Recognition using Event Cameras [20.965606424362726]
We introduce HyperMV, a multi-view event-based action recognition framework.
We present the largest multi-view event-based action dataset $textTHUtextMV-EACTtext-50$, comprising 50 actions from 6 viewpoints.
Experimental results show that HyperMV significantly outperforms baselines in both cross-subject and cross-view scenarios.
arXiv Detail & Related papers (2024-03-28T11:17:00Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - CML-MOTS: Collaborative Multi-task Learning for Multi-Object Tracking
and Segmentation [31.167405688707575]
We propose a framework for instance-level visual analysis on video frames.
It can simultaneously conduct object detection, instance segmentation, and multi-object tracking.
We evaluate the proposed method extensively on KITTI MOTS and MOTS Challenge datasets.
arXiv Detail & Related papers (2023-11-02T04:32:24Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - PointTAD: Multi-Label Temporal Action Detection with Learnable Query
Points [28.607690605262878]
temporal action detection (TAD) usually handles untrimmed videos with small number of action instances from a single label.
In this paper, we focus on the task of multi-label temporal action detection that aims to localize all action instances from a multi-label untrimmed video.
We extend the sparse query-based detection paradigm from the traditional TAD and propose the multi-label TAD framework of PointTAD.
arXiv Detail & Related papers (2022-10-20T06:08:03Z) - BoxMask: Revisiting Bounding Box Supervision for Video Object Detection [11.255962936937744]
We propose BoxMask, which learns discriminative representations by incorporating class-aware pixel-level information.
The proposed module can be effortlessly integrated into any region-based detector to boost detection.
arXiv Detail & Related papers (2022-10-12T08:25:27Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - MIST: Multiple Instance Self-Training Framework for Video Anomaly
Detection [76.80153360498797]
We develop a multiple instance self-training framework (MIST) to efficiently refine task-specific discriminative representations.
MIST is composed of 1) a multiple instance pseudo label generator, which adapts a sparse continuous sampling strategy to produce more reliable clip-level pseudo labels, and 2) a self-guided attention boosted feature encoder.
Our method performs comparably to or even better than existing supervised and weakly supervised methods, specifically obtaining a frame-level AUC 94.83% on ShanghaiTech.
arXiv Detail & Related papers (2021-04-04T15:47:14Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z) - Frame Aggregation and Multi-Modal Fusion Framework for Video-Based
Person Recognition [13.875674649636874]
We propose a Frame Aggregation and Multi-Modal Fusion (FAMF) framework for video-based person recognition.
FAMF aggregates face features and incorporates them with multi-modal information to identify persons in videos.
We show that introducing an attention mechanism to NetVLAD can effectively decrease the impact of low-quality frames.
arXiv Detail & Related papers (2020-10-19T08:06:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.