MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism
- URL: http://arxiv.org/abs/2503.01463v1
- Date: Mon, 03 Mar 2025 12:19:06 GMT
- Title: MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism
- Authors: Zhixiong Nan, Xianghong Li, Jifeng Dai, Tao Xiang,
- Abstract summary: We propose a new decoder architecture with the parallel Multi-time Inquiries (MI) mechanism.<n>Our MI based model, MI-DETR, outperforms all existing DETR-like models on COCO benchmark.<n>A series of diagnostic and visualization experiments demonstrate the effectiveness, rationality, and interpretability of MI.
- Score: 67.56918651825056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Based on analyzing the character of cascaded decoder architecture commonly adopted in existing DETR-like models, this paper proposes a new decoder architecture. The cascaded decoder architecture constrains object queries to update in the cascaded direction, only enabling object queries to learn relatively-limited information from image features. However, the challenges for object detection in natural scenes (e.g., extremely-small, heavily-occluded, and confusingly mixed with the background) require an object detection model to fully utilize image features, which motivates us to propose a new decoder architecture with the parallel Multi-time Inquiries (MI) mechanism. MI enables object queries to learn more comprehensive information, and our MI based model, MI-DETR, outperforms all existing DETR-like models on COCO benchmark under different backbones and training epochs, achieving +2.3 AP and +0.6 AP improvements compared to the most representative model DINO and SOTA model Relation-DETR under ResNet-50 backbone. In addition, a series of diagnostic and visualization experiments demonstrate the effectiveness, rationality, and interpretability of MI.
Related papers
- An Efficient and Mixed Heterogeneous Model for Image Restoration [71.85124734060665]
Current mainstream approaches are based on three architectural paradigms: CNNs, Transformers, and Mambas.
We propose RestorMixer, an efficient and general-purpose IR model based on mixed-architecture fusion.
arXiv Detail & Related papers (2025-04-15T08:19:12Z) - YOLO-RD: Introducing Relevant and Compact Explicit Knowledge to YOLO by Retriever-Dictionary [12.39040757106137]
We introduce an innovative Retriever-Dictionary (RD) module to address this issue.<n>This architecture enables YOLO-based models to efficiently retrieve features from a Dictionary that contains the insight of the dataset.<n>Experiments show that using the RD significantly improves model performance, achieving more than a 3% increase in mean Average Precision for object detection.
arXiv Detail & Related papers (2024-10-20T09:38:58Z) - Transformer Network for Multi-Person Tracking and Re-Identification in
Unconstrained Environment [0.6798775532273751]
Multi-object tracking (MOT) has profound applications in a variety of fields, including surveillance, sports analytics, self-driving, and cooperative robotics.
We put forward an integrated MOT method that marries object detection and identity linkage within a singular, end-to-end trainable framework.
Our system leverages a robust memory-temporal memory module that retains extensive historical observations and effectively encodes them using an attention-based aggregator.
arXiv Detail & Related papers (2023-12-19T08:15:22Z) - Contrastive Learning for Multi-Object Tracking with Transformers [79.61791059432558]
We show how DETR can be turned into a MOT model by employing an instance-level contrastive loss.
Our training scheme learns object appearances while preserving detection capabilities and with little overhead.
Its performance surpasses the previous state-of-the-art by +2.6 mMOTA on the challenging BDD100K dataset.
arXiv Detail & Related papers (2023-11-14T10:07:52Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - Zero-shot Composed Text-Image Retrieval [72.43790281036584]
We consider the problem of composed image retrieval (CIR)
It aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.
arXiv Detail & Related papers (2023-06-12T17:56:01Z) - Unsupervised Multi-object Segmentation Using Attention and Soft-argmax [0.6853165736531939]
We introduce a new architecture for unsupervised object-centric representation learning and multi-object detection and segmentation.
We show that this architecture significantly outperforms the state of the art on complex synthetic benchmarks and provide examples of applications to real-world traffic videos.
arXiv Detail & Related papers (2022-05-26T10:58:48Z) - Triple-level Model Inferred Collaborative Network Architecture for Video
Deraining [43.06607185181434]
We develop a model-guided triple-level optimization framework to deduce network architecture with cooperating optimization and auto-searching mechanism.
Our model shows significant improvements in fidelity and temporal consistency over the state-of-the-art works.
arXiv Detail & Related papers (2021-11-08T13:09:00Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z) - Tidying Deep Saliency Prediction Architectures [6.613005108411055]
In this paper, we identify four key components of saliency models, i.e., input features, multi-level integration, readout architecture, and loss functions.
We propose two novel end-to-end architectures called SimpleNet and MDNSal, which are neater, minimal, more interpretable and achieve state of the art performance on public saliency benchmarks.
arXiv Detail & Related papers (2020-03-10T19:34:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.