DFA: Dynamic Feature Aggregation for Efficient Video Object Detection
- URL: http://arxiv.org/abs/2210.00588v1
- Date: Sun, 2 Oct 2022 17:54:15 GMT
- Title: DFA: Dynamic Feature Aggregation for Efficient Video Object Detection
- Authors: Yiming Cui
- Abstract summary: We propose a vanilla dynamic aggregation module that adaptively selects the frames for feature enhancement.
We extend the vanilla dynamic aggregation module to a more effective and reconfigurable deformable version.
On the ImageNet VID benchmark, integrated with our proposed methods, FGFA and SELSA can improve the inference speed by 31% and 76% respectively.
- Score: 15.897168900583774
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video object detection is a fundamental yet challenging task in computer
vision. One practical solution is to take advantage of temporal information
from the video and apply feature aggregation to enhance the object features in
each frame. Though effective, those existing methods always suffer from low
inference speeds because they use a fixed number of frames for feature
aggregation regardless of the input frame. Therefore, this paper aims to
improve the inference speed of the current feature aggregation-based video
object detectors while maintaining their performance. To achieve this goal, we
propose a vanilla dynamic aggregation module that adaptively selects the frames
for feature enhancement. Then, we extend the vanilla dynamic aggregation module
to a more effective and reconfigurable deformable version. Finally, we
introduce inplace distillation loss to improve the representations of objects
aggregated with fewer frames. Extensive experimental results validate the
effectiveness and efficiency of our proposed methods: On the ImageNet VID
benchmark, integrated with our proposed methods, FGFA and SELSA can improve the
inference speed by 31% and 76% respectively while getting comparable
performance on accuracy.
Related papers
- VADet: Multi-frame LiDAR 3D Object Detection using Variable Aggregation [4.33608942673382]
We propose an efficient adaptive method, which we call VADet, for variable aggregation.
VADet performs aggregation per object, with the number of frames determined by an object's observed properties, such as speed and point density.
To demonstrate its benefits, we apply VADet to three popular single-stage detectors and achieve state-of-the-art performance on a dataset.
arXiv Detail & Related papers (2024-11-20T10:36:41Z) - Practical Video Object Detection via Feature Selection and Aggregation [18.15061460125668]
Video object detection (VOD) needs to concern the high across-frame variation in object appearance, and the diverse deterioration in some frames.
Most of contemporary aggregation methods are tailored for two-stage detectors, suffering from high computational costs.
This study invents a very simple yet potent strategy of feature selection and aggregation, gaining significant accuracy at marginal computational expense.
arXiv Detail & Related papers (2024-07-29T02:12:11Z) - Rethinking Image-to-Video Adaptation: An Object-centric Perspective [61.833533295978484]
We propose a novel and efficient image-to-video adaptation strategy from the object-centric perspective.
Inspired by human perception, we integrate a proxy task of object discovery into image-to-video transfer learning.
arXiv Detail & Related papers (2024-07-09T13:58:10Z) - 3rd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation [63.199793919573295]
Video Object (VOS) is a vital task in computer vision, focusing on distinguishing foreground objects from the background across video frames.
Our work draws inspiration from the Cutie model, and we investigate the effects of object memory, the total number of memory frames, and input resolution on segmentation performance.
arXiv Detail & Related papers (2024-06-06T00:56:25Z) - Spatial-Temporal Multi-level Association for Video Object Segmentation [89.32226483171047]
This paper proposes spatial-temporal multi-level association, which jointly associates reference frame, test frame, and object features.
Specifically, we construct a spatial-temporal multi-level feature association module to learn better target-aware features.
arXiv Detail & Related papers (2024-04-09T12:44:34Z) - Identity-Consistent Aggregation for Video Object Detection [21.295859014601334]
In Video Object Detection (VID), a common practice is to leverage the rich temporal contexts from the video to enhance the object representations in each frame.
We propose ClipVID, a VID model equipped with Identity-Consistent Aggregation layers specifically designed for mining fine-grained and identity-consistent temporal contexts.
Experiments demonstrate the superiority of our method: a state-of-the-art (SOTA) performance (84.7% mAP) on the ImageNet VID dataset while running at a speed about 7x faster (39.3 fps) than previous SOTAs.
arXiv Detail & Related papers (2023-08-15T12:30:22Z) - FAQ: Feature Aggregated Queries for Transformer-based Video Object
Detectors [37.38250825377456]
We take a different perspective on video object detection. In detail, we improve the qualities of queries for the Transformer-based models by aggregation.
On the challenging ImageNet VID benchmark, when integrated with our proposed modules, the current state-of-the-art Transformer-based object detectors can be improved by more than 2.4% on mAP and 4.2% on AP50.
arXiv Detail & Related papers (2023-03-15T02:14:56Z) - Action Keypoint Network for Efficient Video Recognition [63.48422805355741]
This paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net)
AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of action keypoints and then transforms the video recognition into point cloud classification.
Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.
arXiv Detail & Related papers (2022-01-17T09:35:34Z) - CompFeat: Comprehensive Feature Aggregation for Video Instance
Segmentation [67.17625278621134]
Video instance segmentation is a complex task in which we need to detect, segment, and track each object for any given video.
Previous approaches only utilize single-frame features for the detection, segmentation, and tracking of objects.
We propose a novel comprehensive feature aggregation approach (CompFeat) to refine features at both frame-level and object-level with temporal and spatial context information.
arXiv Detail & Related papers (2020-12-07T00:31:42Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.