Related papers: EdgeNAT: Transformer for Efficient Edge Detection

EdgeNAT: Transformer for Efficient Edge Detection

URL: http://arxiv.org/abs/2408.10527v1
Date: Tue, 20 Aug 2024 04:04:22 GMT
Title: EdgeNAT: Transformer for Efficient Edge Detection
Authors: Jinghuai Jie, Yan Guo, Guixing Wu, Junmin Wu, Baojian Hua,
Abstract summary: We propose EdgeNAT, a one-stage transformer-based edge detector with DiNAT as the encoder. Experiments on multiple datasets show that our method achieves state-of-the-art performance on both RGB and depth images.
Score: 2.34098299695111
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers, renowned for their powerful feature extraction capabilities, have played an increasingly prominent role in various vision tasks. Especially, recent advancements present transformer with hierarchical structures such as Dilated Neighborhood Attention Transformer (DiNAT), demonstrating outstanding ability to efficiently capture both global and local features. However, transformers' application in edge detection has not been fully exploited. In this paper, we propose EdgeNAT, a one-stage transformer-based edge detector with DiNAT as the encoder, capable of extracting object boundaries and meaningful edges both accurately and efficiently. On the one hand, EdgeNAT captures global contextual information and detailed local cues with DiNAT, on the other hand, it enhances feature representation with a novel SCAF-MLA decoder by utilizing both inter-spatial and inter-channel relationships of feature maps. Extensive experiments on multiple datasets show that our method achieves state-of-the-art performance on both RGB and depth images. Notably, on the widely used BSDS500 dataset, our L model achieves impressive performances, with ODS F-measure and OIS F-measure of 86.0%, 87.6% for multi-scale input,and 84.9%, and 86.3% for single-scale input, surpassing the current state-of-the-art EDTER by 1.2%, 1.1%, 1.7%, and 1.6%, respectively. Moreover, as for throughput, our approach runs at 20.87 FPS on RTX 4090 GPU with single-scale input. The code for our method will be released soon.

Related papers

A Transformer-in-Transformer Network Utilizing Knowledge Distillation for Image Recognition [0.8196125054032961]
We propose an inner-outer transformer-based architecture, which gives attention to the global and local aspects of the image. Our approach enhances learning efficiency and effectiveness. Remarkably, the proposed Transformer-in-Transformer Network (TITN) model achieves impressive milestones across various datasets.
arXiv Detail & Related papers (2025-02-24T00:41:46Z)
Efficient Semantic Segmentation via Lightweight Multiple-Information Interaction Network [37.84039482457571]
We propose a Lightweight Multiple-Information Interaction Network (LMIINet) for real-time semantic segmentation. With only 0.72M parameters and 11.74G FLOPs, LMIINet excels at balancing accuracy and efficiency.
arXiv Detail & Related papers (2024-10-03T05:45:24Z)
CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer [8.962657021133925]
Cross-scale transformer (CT) processes feature representations at different stages without additional computation. We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales. We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
arXiv Detail & Related papers (2023-12-14T01:33:18Z)
Efficient Remote Sensing Segmentation With Generative Adversarial Transformer [5.728847418491545]
This paper proposes an efficient Generative Adversarial Transfomer (GATrans) for achieving high-precision semantic segmentation. The framework utilizes a Global Transformer Network (GTNet) as the generator, efficiently extracting multi-level features. We validate the effectiveness of our approach through extensive experiments on the Vaihingen dataset, achieving an average F1 score of 90.17% and an overall accuracy of 91.92%.
arXiv Detail & Related papers (2023-10-02T15:46:59Z)
DWRSeg: Rethinking Efficient Acquisition of Multi-scale Contextual Information for Real-time Semantic Segmentation [10.379708894083217]
We propose a highly efficient multi-scale feature extraction method, which decomposes the original single-step method into two steps, Region Residualization-Semantic Residualization. We achieve an mIoU of 72.7% on the Cityscapes test set at a speed of 319.5 FPS on one NVIDIA GeForce GTX 1080 Ti card, which exceeds the latest methods of a speed of 69.5 FPS and 0.8% mIoU.
arXiv Detail & Related papers (2022-12-02T13:55:41Z)
StyleNAT: Giving Each Head a New Perspective [71.84791905122052]
We present a new transformer-based framework, dubbed StyleNAT, targeting high-quality image generation with superior efficiency and flexibility. At the core of our model, is a carefully designed framework that partitions attention heads to capture local and global information. StyleNAT attains a new SOTA FID score on FFHQ-256 with 2.046, beating prior arts with convolutional models such as StyleGAN-XL and transformers such as HIT and StyleSwin.
arXiv Detail & Related papers (2022-11-10T18:55:48Z)
EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications [68.35683849098105]
We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups. Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K. Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-06-21T17:59:56Z)
Unifying Voxel-based Representation with Transformer for 3D Object Detection [143.91910747605107]
We present a unified framework for multi-modality 3D object detection, named UVTR. The proposed method aims to unify multi-modality representations in the voxel space for accurate and robust single- or cross-modality 3D detection. UVTR achieves leading performance in the nuScenes test set with 69.7%, 55.1%, and 71.1% NDS for LiDAR, camera, and multi-modality inputs, respectively.
arXiv Detail & Related papers (2022-06-01T17:02:40Z)
SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection [12.126413875108993]
We propose a cross-modality fusion model SwinNet for RGB-D and RGB-T salient object detection. The proposed model outperforms the state-of-the-art models on RGB-D and RGB-T datasets.
arXiv Detail & Related papers (2022-04-12T07:37:39Z)
A Multi-Stage Duplex Fusion ConvNet for Aerial Scene Classification [4.061135251278187]
We develop ConvNet named multi-stage duplex fusion network (MSDF-Net) MSDF-Net consists of multi-stage structures with DFblock. Experiments are conducted on three widely-used aerial scene classification benchmarks.
arXiv Detail & Related papers (2022-03-29T09:27:53Z)
TransCMD: Cross-Modal Decoder Equipped with Transformer for RGB-D Salient Object Detection [86.94578023985677]
In this work, we rethink this task from the perspective of global information alignment and transformation. Specifically, the proposed method (TransCMD) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path. Experimental results on seven RGB-D SOD benchmark datasets demonstrate that a simple two-stream encoder-decoder framework can surpass the state-of-the-art purely CNN-based methods.
arXiv Detail & Related papers (2021-12-04T15:45:34Z)
Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities. We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement. Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z)
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.