Query matching for spatio-temporal action detection with query-based object detector
- URL: http://arxiv.org/abs/2409.18408v1
- Date: Fri, 27 Sep 2024 02:54:24 GMT
- Title: Query matching for spatio-temporal action detection with query-based object detector
- Authors: Shimon Hori, Kazuki Omi, Toru Tamaki,
- Abstract summary: We propose a method that extends the query-based object detection model, DETR, to maintain temporal consistency in videos.
Our method applies DETR to each frame and uses feature shift to incorporate temporal information.
To overcome this issue, we propose query matching across different frames, ensuring that queries for the same object are matched and used for the feature shift.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a method that extends the query-based object detection model, DETR, to spatio-temporal action detection, which requires maintaining temporal consistency in videos. Our proposed method applies DETR to each frame and uses feature shift to incorporate temporal information. However, DETR's object queries in each frame may correspond to different objects, making a simple feature shift ineffective. To overcome this issue, we propose query matching across different frames, ensuring that queries for the same object are matched and used for the feature shift. Experimental results show that performance on the JHMDB21 dataset improves significantly when query features are shifted using the proposed query matching.
Related papers
- Test-time Adaptation for Cross-modal Retrieval with Query Shift [14.219337695007207]
We propose a novel method dubbed Test-time adaptation for Cross-modal Retrieval (TCR)
In this paper, we observe that query shift would not only diminish the uniformity (namely, within-modality scatter) of the query modality but also amplify the gap between query and gallery modalities.
arXiv Detail & Related papers (2024-10-21T04:08:19Z) - Database-Augmented Query Representation for Information Retrieval [59.57065228857247]
We present a novel retrieval framework called Database-Augmented Query representation (DAQu)
DAQu augments the original query with various (query-related) metadata across multiple tables.
We validate DAQu in diverse retrieval scenarios that can incorporate metadata from the relational database.
arXiv Detail & Related papers (2024-06-23T05:02:21Z) - User Intent Recognition and Semantic Cache Optimization-Based Query Processing Framework using CFLIS and MGR-LAU [0.0]
This work analyzed the informational, navigational, and transactional-based intents in queries for enhanced QP.
For efficient QP, the data is structured using Epanechnikov Kernel-Ordering Points To Identify the Clustering Structure (EK-OPTICS)
The extracted features, detected intents and structured data are inputted to the Multi-head Gated Recurrent Learnable Attention Unit (MGR-LAU)
arXiv Detail & Related papers (2024-06-06T20:28:05Z) - TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression [25.180317527112372]
normalized coordinate expression is a key factor as reliance on hand-crafted components in query-based detectors for temporal action detection (TAD)
We propose modelname, a full end-to-end temporal action detection transformer that integrates time-aligned coordinate expression.
Our approach not only simplifies the TAD process by eliminating the need for hand-crafted components but also significantly improves the performance of query-based detectors.
arXiv Detail & Related papers (2024-04-03T02:16:30Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - FAQ: Feature Aggregated Queries for Transformer-based Video Object
Detectors [37.38250825377456]
We take a different perspective on video object detection. In detail, we improve the qualities of queries for the Transformer-based models by aggregation.
On the challenging ImageNet VID benchmark, when integrated with our proposed modules, the current state-of-the-art Transformer-based object detectors can be improved by more than 2.4% on mAP and 4.2% on AP50.
arXiv Detail & Related papers (2023-03-15T02:14:56Z) - ComplETR: Reducing the cost of annotations for object detection in dense
scenes with vision transformers [73.29057814695459]
ComplETR is designed to explicitly complete missing annotations in partially annotated dense scene datasets.
This reduces the need to annotate every object instance in the scene thereby reducing annotation cost.
We show performance improvement for several popular detectors such as Faster R-CNN, Cascade R-CNN, CenterNet2, and Deformable DETR.
arXiv Detail & Related papers (2022-09-13T00:11:16Z) - ReAct: Temporal Action Detection with Relational Queries [84.76646044604055]
This work aims at advancing temporal action detection (TAD) using an encoder-decoder framework with action queries.
We first propose a relational attention mechanism in the decoder, which guides the attention among queries based on their relations.
Lastly, we propose to predict the localization quality of each action query at inference in order to distinguish high-quality queries.
arXiv Detail & Related papers (2022-07-14T17:46:37Z) - Temporal Query Networks for Fine-grained Video Understanding [88.9877174286279]
We cast this into a query-response mechanism, where each query addresses a particular question, and has its own response label set.
We evaluate the method extensively on the FineGym and Diving48 benchmarks for fine-grained action classification and surpass the state-of-the-art using only RGB features.
arXiv Detail & Related papers (2021-04-19T17:58:48Z) - Query Resolution for Conversational Search with Limited Supervision [63.131221660019776]
We propose QuReTeC (Query Resolution by Term Classification), a neural query resolution model based on bidirectional transformers.
We show that QuReTeC outperforms state-of-the-art models, and furthermore, that our distant supervision method can be used to substantially reduce the amount of human-curated data required to train QuReTeC.
arXiv Detail & Related papers (2020-05-24T11:37:22Z) - Evaluating Temporal Queries Over Video Feeds [25.04363138106074]
Temporal queries involving objects and their co-occurrences in video feeds are of interest to many applications ranging from law enforcement to security and safety.
We present an architecture consisting of three layers, namely object detection/tracking, intermediate data generation and query evaluation.
We propose two techniques,MFS and SSG, to organize all detected objects in the intermediate data generation layer.
We also introduce an algorithm called State Traversal (ST) that processes incoming frames against the SSG and efficiently prunes objects and frames unrelated to query evaluation.
arXiv Detail & Related papers (2020-03-02T14:55:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.