FAQ: Feature Aggregated Queries for Transformer-based Video Object
Detectors
- URL: http://arxiv.org/abs/2303.08319v2
- Date: Mon, 20 Mar 2023 16:54:34 GMT
- Title: FAQ: Feature Aggregated Queries for Transformer-based Video Object
Detectors
- Authors: Yiming Cui, Linjie Yang
- Abstract summary: We take a different perspective on video object detection. In detail, we improve the qualities of queries for the Transformer-based models by aggregation.
On the challenging ImageNet VID benchmark, when integrated with our proposed modules, the current state-of-the-art Transformer-based object detectors can be improved by more than 2.4% on mAP and 4.2% on AP50.
- Score: 37.38250825377456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video object detection needs to solve feature degradation situations that
rarely happen in the image domain. One solution is to use the temporal
information and fuse the features from the neighboring frames. With
Transformerbased object detectors getting a better performance on the image
domain tasks, recent works began to extend those methods to video object
detection. However, those existing Transformer-based video object detectors
still follow the same pipeline as those used for classical object detectors,
like enhancing the object feature representations by aggregation. In this work,
we take a different perspective on video object detection. In detail, we
improve the qualities of queries for the Transformer-based models by
aggregation. To achieve this goal, we first propose a vanilla query aggregation
module that weighted averages the queries according to the features of the
neighboring frames. Then, we extend the vanilla module to a more practical
version, which generates and aggregates queries according to the features of
the input frames. Extensive experimental results validate the effectiveness of
our proposed methods: On the challenging ImageNet VID benchmark, when
integrated with our proposed modules, the current state-of-the-art
Transformer-based object detectors can be improved by more than 2.4% on mAP and
4.2% on AP50.
Related papers
- Learning Dynamic Query Combinations for Transformer-based Object
Detection and Segmentation [37.24532930188581]
Transformer-based detection and segmentation methods use a list of learned detection queries to retrieve information from the transformer network.
We empirically find that random convex combinations of the learned queries are still good for the corresponding models.
We propose to learn a convex combination with dynamic coefficients based on the high-level semantics of the image.
arXiv Detail & Related papers (2023-07-23T06:26:27Z) - ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: Semi-Supervised
Video Object Segmentation [62.98078087018469]
We introduce MSDeAOT, a variant of the AOT framework that incorporates transformers at multiple feature scales.
MSDeAOT efficiently propagates object masks from previous frames to the current frame using a feature scale with a stride of 16.
We also employ GPM in a more refined feature scale with a stride of 8, leading to improved accuracy in detecting and tracking small objects.
arXiv Detail & Related papers (2023-07-05T03:43:15Z) - DFA: Dynamic Feature Aggregation for Efficient Video Object Detection [15.897168900583774]
We propose a vanilla dynamic aggregation module that adaptively selects the frames for feature enhancement.
We extend the vanilla dynamic aggregation module to a more effective and reconfigurable deformable version.
On the ImageNet VID benchmark, integrated with our proposed methods, FGFA and SELSA can improve the inference speed by 31% and 76% respectively.
arXiv Detail & Related papers (2022-10-02T17:54:15Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Recent Trends in 2D Object Detection and Applications in Video Event
Recognition [0.76146285961466]
We discuss the pioneering works in object detection, followed by the recent breakthroughs that employ deep learning.
We highlight recent datasets for 2D object detection both in images and videos, and present a comparative performance summary of various state-of-the-art object detection techniques.
arXiv Detail & Related papers (2022-02-07T14:15:11Z) - TransVOD: End-to-end Video Object Detection with Spatial-Temporal
Transformers [96.981282736404]
We present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures.
Our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0% mAP.
Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7% mAP while running at around 30 FPS.
arXiv Detail & Related papers (2022-01-13T16:17:34Z) - End-to-End Video Object Detection with Spatial-Temporal Transformers [33.40462554784311]
We present TransVOD, an end-to-end video object detection model based on a spatial-temporal Transformer architecture.
Our method does not need complicated post-processing methods such as Seq-NMS or Tubelet rescoring.
These designs boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID dataset.
arXiv Detail & Related papers (2021-05-23T11:44:22Z) - Ensembling object detectors for image and video data analysis [98.26061123111647]
We propose a method for ensembling the outputs of multiple object detectors for improving detection performance and precision of bounding boxes on image data.
We extend it to video data by proposing a two-stage tracking-based scheme for detection refinement.
arXiv Detail & Related papers (2021-02-09T12:38:16Z) - LiDAR-based Online 3D Video Object Detection with Graph-based Message
Passing and Spatiotemporal Transformer Attention [100.52873557168637]
3D object detectors usually focus on the single-frame detection, while ignoring the information in consecutive point cloud frames.
In this paper, we propose an end-to-end online 3D video object detector that operates on point sequences.
arXiv Detail & Related papers (2020-04-03T06:06:52Z) - Plug & Play Convolutional Regression Tracker for Video Object Detection [37.47222104272429]
Video object detection targets to simultaneously localize the bounding boxes of the objects and identify their classes in a given video.
One challenge for video object detection is to consistently detect all objects across the whole video.
We propose a Plug & Play scale-adaptive convolutional regression tracker for the video object detection task.
arXiv Detail & Related papers (2020-03-02T15:57:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.