Related papers: DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model

DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model

URL: http://arxiv.org/abs/2410.16707v1
Date: Tue, 22 Oct 2024 05:22:49 GMT
Title: DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model
Authors: Zhixiong Nan, Xianghong Li, Tao Xiang, Jifeng Dai,
Abstract summary: The performance of object detection lags behind that of instance segmentation (i.e., performance imbalance) when investigating the intermediate results from the beginning transformer decoder layer of MaskDINO. This paper proposes DI-MaskDINO model, the core idea of which is to improve the final performance by alleviating the detection-segmentation imbalance. DI-MaskDINO outperforms existing joint object detection and instance segmentation models on COCO and BDD100K benchmarks.
Score: 67.56918651825056
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper is motivated by an interesting phenomenon: the performance of object detection lags behind that of instance segmentation (i.e., performance imbalance) when investigating the intermediate results from the beginning transformer decoder layer of MaskDINO (i.e., the SOTA model for joint detection and segmentation). This phenomenon inspires us to think about a question: will the performance imbalance at the beginning layer of transformer decoder constrain the upper bound of the final performance? With this question in mind, we further conduct qualitative and quantitative pre-experiments, which validate the negative impact of detection-segmentation imbalance issue on the model performance. To address this issue, this paper proposes DI-MaskDINO model, the core idea of which is to improve the final performance by alleviating the detection-segmentation imbalance. DI-MaskDINO is implemented by configuring our proposed De-Imbalance (DI) module and Balance-Aware Tokens Optimization (BATO) module to MaskDINO. DI is responsible for generating balance-aware query, and BATO uses the balance-aware query to guide the optimization of the initial feature tokens. The balance-aware query and optimized feature tokens are respectively taken as the Query and Key&Value of transformer decoder to perform joint object detection and instance segmentation. DI-MaskDINO outperforms existing joint object detection and instance segmentation models on COCO and BDD100K benchmarks, achieving +1.2 $AP^{box}$ and +0.9 $AP^{mask}$ improvements compared to SOTA joint detection and segmentation model MaskDINO. In addition, DI-MaskDINO also obtains +1.0 $AP^{box}$ improvement compared to SOTA object detection model DINO and +3.0 $AP^{mask}$ improvement compared to SOTA segmentation model Mask2Former.

Related papers

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z)
Enhancing DNA Foundation Models to Address Masking Inefficiencies [18.54660252939211]
We propose a modified encoder-decoder architecture based on the masked autoencoder framework. We evaluate our approach on the BIOSCAN-5M dataset, comprising over 2 million unique DNA barcodes.
arXiv Detail & Related papers (2025-02-25T17:56:25Z)
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs. We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms. Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z)
PointOBB-v3: Expanding Performance Boundaries of Single Point-Supervised Oriented Object Detection [65.84604846389624]
We propose PointOBB-v3, a stronger single point-supervised OOD framework. It generates pseudo rotated boxes without additional priors and incorporates support for the end-to-end paradigm. Our method achieves an average improvement in accuracy of 3.56% in comparison to previous state-of-the-art methods.
arXiv Detail & Related papers (2025-01-23T18:18:15Z)
Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones. In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z)
Precision matters: Precision-aware ensemble for weakly supervised semantic segmentation [14.931551206723041]
Weakly Supervised Semantic (WSSS) employs weak supervision, such as image-level labels, to train the segmentation model. We propose ORANDNet, an advanced ensemble approach tailored for WSSS.
arXiv Detail & Related papers (2024-06-28T03:58:02Z)
MOD-CL: Multi-label Object Detection with Constrained Loss [3.92610460921618]
In this paper, we use $mathrmMOD_YOLO$, a multi-label object detection model built upon the state-of-the-art object detection model YOLOv8. In Task 1, we introduce the Corrector Model and Blender Model, two new models that follow after the object detection process, aiming to generate a more constrained output. For Task 2, constrained losses have been incorporated into the $mathrmMOD_YOLO$ architecture using Product T-Norm.
arXiv Detail & Related papers (2024-01-31T23:13:20Z)
Decoupled DETR For Few-shot Object Detection [4.520231308678286]
We improve the FSOD model to address the severe issue of sample imbalance and weak feature propagation. We build a unified decoder module that could dynamically fuse the decoder layers as the output feature. Our results indicate that our proposed module could achieve stable improvements of 5% to 10% in both fine-tuning and meta-learning paradigms.
arXiv Detail & Related papers (2023-11-20T07:10:39Z)
MoPA: Multi-Modal Prior Aided Domain Adaptation for 3D Semantic Segmentation [38.42077782990957]
Multi-modal unsupervised domain adaptation (MM-UDA) is a practical solution to embed semantic understanding in autonomous systems without expensive point-wise annotations. Previous MM-UDA methods suffer from significant class-imbalanced performance, restricting their adoption in real applications. We propose Multi-modal Prior Aided (MoPA) domain adaptation to improve the performance of rare objects.
arXiv Detail & Related papers (2023-09-21T07:30:21Z)
ARS-DETR: Aspect Ratio-Sensitive Detection Transformer for Aerial Oriented Object Detection [55.291579862817656]
Existing oriented object detection methods commonly use metric AP$_50$ to measure the performance of the model. We argue that AP$_50$ is inherently unsuitable for oriented object detection due to its large tolerance in angle deviation. We propose an Aspect Ratio Sensitive Oriented Object Detector with Transformer, termed ARS-DETR, which exhibits a competitive performance.
arXiv Detail & Related papers (2023-03-09T02:20:56Z)
D2Q-DETR: Decoupling and Dynamic Queries for Oriented Object Detection with Transformers [14.488821968433834]
We propose an end-to-end framework for oriented object detection. Our framework is based on DETR, with the box regression head replaced with a points prediction head. Experiments on the largest and challenging DOTA-v1.0 and DOTA-v1.5 datasets show that D2Q-DETR outperforms existing NMS-based and NMS-free oriented object detection methods.
arXiv Detail & Related papers (2023-03-01T14:36:19Z)
GCoNet+: A Stronger Group Collaborative Co-Salient Object Detector [156.43671738038657]
We present a novel end-to-end group collaborative learning network, termed GCoNet+. GCoNet+ can effectively and efficiently identify co-salient objects in natural scenes.
arXiv Detail & Related papers (2022-05-30T23:49:19Z)
When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable. In order to achieve a better accuracy, we propose two lightweight modules. DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers. QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z)
Online Multi-Object Tracking and Segmentation with GMPHD Filter and Mask-based Affinity Fusion [79.87371506464454]
We propose a fully online multi-object tracking and segmentation (MOTS) method that uses instance segmentation results as an input. The proposed method is based on the Gaussian mixture probability hypothesis density (GMPHD) filter, a hierarchical data association (HDA), and a mask-based affinity fusion (MAF) model. In the experiments on the two popular MOTS datasets, the key modules show some improvements.
arXiv Detail & Related papers (2020-08-31T21:06:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.