DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model
- URL: http://arxiv.org/abs/2410.16707v1
- Date: Tue, 22 Oct 2024 05:22:49 GMT
- Title: DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model
- Authors: Zhixiong Nan, Xianghong Li, Tao Xiang, Jifeng Dai,
- Abstract summary: The performance of object detection lags behind that of instance segmentation (i.e., performance imbalance) when investigating the intermediate results from the beginning transformer decoder layer of MaskDINO.
This paper proposes DI-MaskDINO model, the core idea of which is to improve the final performance by alleviating the detection-segmentation imbalance.
DI-MaskDINO outperforms existing joint object detection and instance segmentation models on COCO and BDD100K benchmarks.
- Score: 67.56918651825056
- License:
- Abstract: This paper is motivated by an interesting phenomenon: the performance of object detection lags behind that of instance segmentation (i.e., performance imbalance) when investigating the intermediate results from the beginning transformer decoder layer of MaskDINO (i.e., the SOTA model for joint detection and segmentation). This phenomenon inspires us to think about a question: will the performance imbalance at the beginning layer of transformer decoder constrain the upper bound of the final performance? With this question in mind, we further conduct qualitative and quantitative pre-experiments, which validate the negative impact of detection-segmentation imbalance issue on the model performance. To address this issue, this paper proposes DI-MaskDINO model, the core idea of which is to improve the final performance by alleviating the detection-segmentation imbalance. DI-MaskDINO is implemented by configuring our proposed De-Imbalance (DI) module and Balance-Aware Tokens Optimization (BATO) module to MaskDINO. DI is responsible for generating balance-aware query, and BATO uses the balance-aware query to guide the optimization of the initial feature tokens. The balance-aware query and optimized feature tokens are respectively taken as the Query and Key&Value of transformer decoder to perform joint object detection and instance segmentation. DI-MaskDINO outperforms existing joint object detection and instance segmentation models on COCO and BDD100K benchmarks, achieving +1.2 $AP^{box}$ and +0.9 $AP^{mask}$ improvements compared to SOTA joint detection and segmentation model MaskDINO. In addition, DI-MaskDINO also obtains +1.0 $AP^{box}$ improvement compared to SOTA object detection model DINO and +3.0 $AP^{mask}$ improvement compared to SOTA segmentation model Mask2Former.
Related papers
- MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs.
We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms.
Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z) - PointOBB-v3: Expanding Performance Boundaries of Single Point-Supervised Oriented Object Detection [65.84604846389624]
We propose PointOBB-v3, a stronger single point-supervised OOD framework.
It generates pseudo rotated boxes without additional priors and incorporates support for the end-to-end paradigm.
Our method achieves an average improvement in accuracy of 3.56% in comparison to previous state-of-the-art methods.
arXiv Detail & Related papers (2025-01-23T18:18:15Z) - Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Precision matters: Precision-aware ensemble for weakly supervised semantic segmentation [14.931551206723041]
Weakly Supervised Semantic (WSSS) employs weak supervision, such as image-level labels, to train the segmentation model.
We propose ORANDNet, an advanced ensemble approach tailored for WSSS.
arXiv Detail & Related papers (2024-06-28T03:58:02Z) - MOD-CL: Multi-label Object Detection with Constrained Loss [3.92610460921618]
In this paper, we use $mathrmMOD_YOLO$, a multi-label object detection model built upon the state-of-the-art object detection model YOLOv8.
In Task 1, we introduce the Corrector Model and Blender Model, two new models that follow after the object detection process, aiming to generate a more constrained output.
For Task 2, constrained losses have been incorporated into the $mathrmMOD_YOLO$ architecture using Product T-Norm.
arXiv Detail & Related papers (2024-01-31T23:13:20Z) - Decoupled DETR For Few-shot Object Detection [4.520231308678286]
We improve the FSOD model to address the severe issue of sample imbalance and weak feature propagation.
We build a unified decoder module that could dynamically fuse the decoder layers as the output feature.
Our results indicate that our proposed module could achieve stable improvements of 5% to 10% in both fine-tuning and meta-learning paradigms.
arXiv Detail & Related papers (2023-11-20T07:10:39Z) - MoPA: Multi-Modal Prior Aided Domain Adaptation for 3D Semantic
Segmentation [38.42077782990957]
Multi-modal unsupervised domain adaptation (MM-UDA) is a practical solution to embed semantic understanding in autonomous systems without expensive point-wise annotations.
Previous MM-UDA methods suffer from significant class-imbalanced performance, restricting their adoption in real applications.
We propose Multi-modal Prior Aided (MoPA) domain adaptation to improve the performance of rare objects.
arXiv Detail & Related papers (2023-09-21T07:30:21Z) - D2Q-DETR: Decoupling and Dynamic Queries for Oriented Object Detection
with Transformers [14.488821968433834]
We propose an end-to-end framework for oriented object detection.
Our framework is based on DETR, with the box regression head replaced with a points prediction head.
Experiments on the largest and challenging DOTA-v1.0 and DOTA-v1.5 datasets show that D2Q-DETR outperforms existing NMS-based and NMS-free oriented object detection methods.
arXiv Detail & Related papers (2023-03-01T14:36:19Z) - GCoNet+: A Stronger Group Collaborative Co-Salient Object Detector [156.43671738038657]
We present a novel end-to-end group collaborative learning network, termed GCoNet+.
GCoNet+ can effectively and efficiently identify co-salient objects in natural scenes.
arXiv Detail & Related papers (2022-05-30T23:49:19Z) - Online Multi-Object Tracking and Segmentation with GMPHD Filter and
Mask-based Affinity Fusion [79.87371506464454]
We propose a fully online multi-object tracking and segmentation (MOTS) method that uses instance segmentation results as an input.
The proposed method is based on the Gaussian mixture probability hypothesis density (GMPHD) filter, a hierarchical data association (HDA), and a mask-based affinity fusion (MAF) model.
In the experiments on the two popular MOTS datasets, the key modules show some improvements.
arXiv Detail & Related papers (2020-08-31T21:06:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.