Investigating the Robustness and Properties of Detection Transformers
(DETR) Toward Difficult Images
- URL: http://arxiv.org/abs/2310.08772v1
- Date: Thu, 12 Oct 2023 23:38:52 GMT
- Title: Investigating the Robustness and Properties of Detection Transformers
(DETR) Toward Difficult Images
- Authors: Zhao Ning Zou, Yuhang Zhang, Robert Wijaya
- Abstract summary: Transformer-based object detectors (DETR) have shown significant performance across machine vision tasks.
The critical issue to be addressed is how this model architecture can handle different image nuisances.
We studied this issue by measuring the performance of DETR with different experiments and benchmarking the network.
- Score: 1.5727605363545245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based object detectors (DETR) have shown significant performance
across machine vision tasks, ultimately in object detection. This detector is
based on a self-attention mechanism along with the transformer encoder-decoder
architecture to capture the global context in the image. The critical issue to
be addressed is how this model architecture can handle different image
nuisances, such as occlusion and adversarial perturbations. We studied this
issue by measuring the performance of DETR with different experiments and
benchmarking the network with convolutional neural network (CNN) based
detectors like YOLO and Faster-RCNN. We found that DETR performs well when it
comes to resistance to interference from information loss in occlusion images.
Despite that, we found that the adversarial stickers put on the image require
the network to produce a new unnecessary set of keys, queries, and values,
which in most cases, results in a misdirection of the network. DETR also
performed poorer than YOLOv5 in the image corruption benchmark. Furthermore, we
found that DETR depends heavily on the main query when making a prediction,
which leads to imbalanced contributions between queries since the main query
receives most of the gradient flow.
Related papers
- Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - Bridging the Performance Gap between DETR and R-CNN for Graphical Object
Detection in Document Images [11.648151981111436]
This paper takes an important step in bridging the performance gap between DETR and R-CNN for graphical object detection.
We modify object queries in different ways, using points, anchor boxes and adding positive and negative noise to the anchors to boost performance.
We evaluate our approach on the four graphical datasets: PubTables, TableBank, NTable and PubLaynet.
arXiv Detail & Related papers (2023-06-23T14:46:03Z) - Image Deblurring by Exploring In-depth Properties of Transformer [86.7039249037193]
We leverage deep features extracted from a pretrained vision transformer (ViT) to encourage recovered images to be sharp without sacrificing the performance measured by the quantitative metrics.
By comparing the transformer features between recovered image and target one, the pretrained transformer provides high-resolution blur-sensitive semantic information.
One regards the features as vectors and computes the discrepancy between representations extracted from recovered image and target one in Euclidean space.
arXiv Detail & Related papers (2023-03-24T14:14:25Z) - Adversarially-Aware Robust Object Detector [85.10894272034135]
We propose a Robust Detector (RobustDet) based on adversarially-aware convolution to disentangle gradients for model learning on clean and adversarial images.
Our model effectively disentangles gradients and significantly enhances the detection robustness with maintaining the detection ability on clean images.
arXiv Detail & Related papers (2022-07-13T13:59:59Z) - Pyramid Grafting Network for One-Stage High Resolution Saliency
Detection [29.013012579688347]
We propose a one-stage framework called Pyramid Grafting Network (PGNet) to extract features from different resolution images independently.
An attention-based Cross-Model Grafting Module (CMGM) is proposed to enable CNN branch to combine broken detailed information more holistically.
We contribute a new Ultra-High-Resolution Saliency Detection dataset UHRSD, containing 5,920 images at 4K-8K resolutions.
arXiv Detail & Related papers (2022-04-11T12:22:21Z) - Miti-DETR: Object Detection based on Transformers with Mitigatory
Self-Attention Convergence [17.854940064699985]
We propose a transformer architecture with a mitigatory self-attention mechanism.
Miti-DETR reserves the inputs of each single attention layer to the outputs of that layer so that the "non-attention" information has participated in attention propagation.
Miti-DETR significantly enhances the average detection precision and convergence speed towards existing DETR-based models.
arXiv Detail & Related papers (2021-12-26T03:23:59Z) - Understanding Robustness of Transformers for Image Classification [34.51672491103555]
Vision Transformer (ViT) has surpassed ResNets for image classification.
Details of the Transformer architecture lead one to wonder whether these networks are as robust.
We find that ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations.
arXiv Detail & Related papers (2021-03-26T16:47:55Z) - D-Unet: A Dual-encoder U-Net for Image Splicing Forgery Detection and
Localization [108.8592577019391]
Image splicing forgery detection is a global binary classification task that distinguishes the tampered and non-tampered regions by image fingerprints.
We propose a novel network called dual-encoder U-Net (D-Unet) for image splicing forgery detection, which employs an unfixed encoder and a fixed encoder.
In an experimental comparison study of D-Unet and state-of-the-art methods, D-Unet outperformed the other methods in image-level and pixel-level detection.
arXiv Detail & Related papers (2020-12-03T10:54:02Z) - Rethinking Transformer-based Set Prediction for Object Detection [57.7208561353529]
Experimental results show that the proposed methods not only converge much faster than the original DETR, but also significantly outperform DETR and other baselines in terms of detection accuracy.
arXiv Detail & Related papers (2020-11-21T21:59:42Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z) - Real-Time Detectors for Digital and Physical Adversarial Inputs to
Perception Systems [11.752184033538636]
Deep neural network (DNN) models have proven to be vulnerable to adversarial digital and physical attacks.
We propose a novel attack- and dataset-agnostic and real-time detector for both types of adversarial inputs to DNN-based perception systems.
In particular, the proposed detector relies on the observation that adversarial images are sensitive to certain label-invariant transformations.
arXiv Detail & Related papers (2020-02-23T00:03:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.