Action tube generation by person query matching for spatio-temporal action detection
- URL: http://arxiv.org/abs/2503.12969v1
- Date: Mon, 17 Mar 2025 09:26:06 GMT
- Title: Action tube generation by person query matching for spatio-temporal action detection
- Authors: Kazuki Omi, Jion Oshima, Toru Tamaki,
- Abstract summary: Method generates action tubes from the original video without relying on post-processing steps such as IoU-based linking and clip splitting.<n>Our approach applies query-based detection (DETR) to each frame and matches DETR queries to link the same person across frames.<n>Action classes are predicted using the sequence of queries obtained from QMM matching, allowing for variable-length inputs from videos longer than a single clip.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a method for spatio-temporal action detection (STAD) that directly generates action tubes from the original video without relying on post-processing steps such as IoU-based linking and clip splitting. Our approach applies query-based detection (DETR) to each frame and matches DETR queries to link the same person across frames. We introduce the Query Matching Module (QMM), which uses metric learning to bring queries for the same person closer together across frames compared to queries for different people. Action classes are predicted using the sequence of queries obtained from QMM matching, allowing for variable-length inputs from videos longer than a single clip. Experimental results on JHMDB, UCF101-24, and AVA datasets demonstrate that our method performs well for large position changes of people while offering superior computational efficiency and lower resource requirements.
Related papers
- A Flexible and Scalable Framework for Video Moment Search [51.47907684209207]
This paper introduces a flexible framework for retrieving a ranked list of moments from collection of videos in any length to match a text query.<n>Our framework, called Segment-Proposal-Ranking (SPR), simplifies the search process into three independent stages: segment retrieval, proposal generation, and moment refinement with re-ranking.<n> Evaluations on the TVR-Ranking dataset demonstrate that our framework achieves state-of-the-art performance with significant reductions in computational cost and processing time.
arXiv Detail & Related papers (2025-01-09T08:54:19Z) - Query matching for spatio-temporal action detection with query-based object detector [0.0]
We propose a method that extends the query-based object detection model, DETR, to maintain temporal consistency in videos.
Our method applies DETR to each frame and uses feature shift to incorporate temporal information.
To overcome this issue, we propose query matching across different frames, ensuring that queries for the same object are matched and used for the feature shift.
arXiv Detail & Related papers (2024-09-27T02:54:24Z) - TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression [25.180317527112372]
normalized coordinate expression is a key factor as reliance on hand-crafted components in query-based detectors for temporal action detection (TAD)
We propose modelname, a full end-to-end temporal action detection transformer that integrates time-aligned coordinate expression.
Our approach not only simplifies the TAD process by eliminating the need for hand-crafted components but also significantly improves the performance of query-based detectors.
arXiv Detail & Related papers (2024-04-03T02:16:30Z) - Single-Stage Visual Query Localization in Egocentric Videos [79.71065005161566]
We propose a single-stage VQL framework that is end-to-end trainable.
We establish the query-video relationship by considering query-to-frame correspondences between the query and each video frame.
Our experiments demonstrate that our approach outperforms prior VQL methods by 20% accuracy while obtaining a 10x improvement in inference speed.
arXiv Detail & Related papers (2023-06-15T17:57:28Z) - HyRSM++: Hybrid Relation Guided Temporal Set Matching for Few-shot
Action Recognition [51.2715005161475]
We propose a novel Hybrid Relation guided temporal Set Matching approach for few-shot action recognition.
The core idea of HyRSM++ is to integrate all videos within the task to learn discriminative representations.
We show that our method achieves state-of-the-art performance under various few-shot settings.
arXiv Detail & Related papers (2023-01-09T13:32:50Z) - Enhanced Training of Query-Based Object Detection via Selective Query
Recollection [35.3219210570517]
This paper investigates a phenomenon where query-based object detectors mispredict at the last decoding stage while predicting correctly at an intermediate stage.
We design and present Selective Query Recollection, a simple and effective training strategy for query-based object detectors.
arXiv Detail & Related papers (2022-12-15T02:45:57Z) - Hybrid Relation Guided Set Matching for Few-shot Action Recognition [51.3308583226322]
We propose a novel Hybrid Relation guided Set Matching (HyRSM) approach that incorporates two key components.
The purpose of the hybrid relation module is to learn task-specific embeddings by fully exploiting associated relations within and cross videos in an episode.
We evaluate HyRSM on six challenging benchmarks, and the experimental results show its superiority over the state-of-the-art methods by a convincing margin.
arXiv Detail & Related papers (2022-04-28T11:43:41Z) - Temporal Query Networks for Fine-grained Video Understanding [88.9877174286279]
We cast this into a query-response mechanism, where each query addresses a particular question, and has its own response label set.
We evaluate the method extensively on the FineGym and Diving48 benchmarks for fine-grained action classification and surpass the state-of-the-art using only RGB features.
arXiv Detail & Related papers (2021-04-19T17:58:48Z) - Temporal-Relational CrossTransformers for Few-Shot Action Recognition [82.0033565755246]
We propose a novel approach to few-shot action recognition, finding temporally-corresponding frames between the query and videos in the support set.
Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos.
A detailed ablation showcases the importance of matching to multiple support set videos and learning higher-order CrossTransformers.
arXiv Detail & Related papers (2021-01-15T15:47:35Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.