DeblurDiNAT: A Compact Model with Exceptional Generalization and Visual Fidelity on Unseen Domains
- URL: http://arxiv.org/abs/2403.13163v5
- Date: Wed, 15 Jan 2025 18:45:15 GMT
- Title: DeblurDiNAT: A Compact Model with Exceptional Generalization and Visual Fidelity on Unseen Domains
- Authors: Hanzhou Liu, Binghan Li, Chengkai Liu, Mi Lu,
- Abstract summary: DeDiNAT is a deblurring Transformer based on Dilated Neighborhood Attention.<n>A cross-channel learner aids the Transformer block to understand the short-range relationships between adjacent channels.<n>Compared to state-of-the-art models, our compact DeDiNAT demonstrates superior generalization capabilities and achieves remarkable performance in perceptual metrics.
- Score: 1.5124439914522694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent deblurring networks have effectively restored clear images from the blurred ones. However, they often struggle with generalization to unknown domains. Moreover, these models typically focus on distortion metrics such as PSNR and SSIM, neglecting the critical aspect of metrics aligned with human perception. To address these limitations, we propose DeblurDiNAT, a deblurring Transformer based on Dilated Neighborhood Attention. First, DeblurDiNAT employs an alternating dilation factor paradigm to capture both local and global blurred patterns, enhancing generalization and perceptual clarity. Second, a local cross-channel learner aids the Transformer block to understand the short-range relationships between adjacent channels. Additionally, we present a linear feed-forward network with a simple while effective design. Finally, a dual-stage feature fusion module is introduced as an alternative to the existing approach, which efficiently process multi-scale visual information across network levels. Compared to state-of-the-art models, our compact DeblurDiNAT demonstrates superior generalization capabilities and achieves remarkable performance in perceptual metrics, while maintaining a favorable model size.
Related papers
- Cross Paradigm Representation and Alignment Transformer for Image Deraining [40.66823807648992]
We propose a novel Cross Paradigm Representation and Alignment Transformer (CPRAformer)
Its core idea is the hierarchical representation and alignment, leveraging the strengths of both paradigms to aid image reconstruction.
We use two types of self-attention in the Transformer blocks: sparse prompt channel self-attention (SPC-SA) and spatial pixel refinement self-attention (SPR-SA)
arXiv Detail & Related papers (2025-04-23T06:44:46Z) - Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers [59.0181939916084]
Traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries.
We propose a novel Priors Distillation (RPD) method to extract priors from the well-trained transformers on massive images.
Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification.
arXiv Detail & Related papers (2024-07-26T06:29:09Z) - Multi-view Aggregation Network for Dichotomous Image Segmentation [76.75904424539543]
Dichotomous Image (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images.
Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement.
Inspired by it, we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet)
Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed.
arXiv Detail & Related papers (2024-04-11T03:00:00Z) - Dual-Stream Attention Transformers for Sewer Defect Classification [2.5499055723658097]
We propose a dual-stream vision transformer architecture that processes RGB and optical flow inputs for efficient sewer defect classification.
Our key idea is to use self-attention regularization to harness the complementary strengths of the RGB and motion streams.
By leveraging motion cues through a self-attention regularizer, we align and enhance RGB attention maps, enabling the network to concentrate on pertinent input regions.
arXiv Detail & Related papers (2023-11-07T02:31:51Z) - Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components.
CNNs are used to augment the local texture information of coarse priors.
DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z) - Multi-scale Alternated Attention Transformer for Generalized Stereo
Matching [7.493797166406228]
We present a simple but highly effective network called Alternated Attention U-shaped Transformer (AAUformer) to balance the impact of epipolar line in dual and single view.
Compared to other models, our model has several main designs.
We performed a series of both comparative studies and ablation studies on several mainstream stereo matching datasets.
arXiv Detail & Related papers (2023-08-06T08:22:39Z) - Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based
Transformer Network for Remote Sensing Image Super-Resolution [13.894645293832044]
Transformer-based models have shown competitive performance in remote sensing image super-resolution (RSISR)
We propose a novel transformer architecture called Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network (SPIFFNet) for RSISR.
Our proposed model effectively enhances global cognition and understanding of the entire image, facilitating efficient integration of features cross-stages.
arXiv Detail & Related papers (2023-07-06T13:19:06Z) - Local-Global Transformer Enhanced Unfolding Network for Pan-sharpening [13.593522290577512]
Pan-sharpening aims to increase the spatial resolution of the low-resolution multispectral (LrMS) image with the guidance of the corresponding panchromatic (PAN) image.
Although deep learning (DL)-based pan-sharpening methods have achieved promising performance, most of them have a two-fold deficiency.
arXiv Detail & Related papers (2023-04-28T03:34:36Z) - Global-to-Local Modeling for Video-based 3D Human Pose and Shape
Estimation [53.04781510348416]
Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness.
We propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT)
Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.
arXiv Detail & Related papers (2023-03-26T14:57:49Z) - Hierarchical Cross-modal Transformer for RGB-D Salient Object Detection [6.385624548310884]
We propose the Hierarchical Cross-modal Transformer (HCT), a new multi-modal transformer, to tackle this problem.
Unlike previous multi-modal transformers that directly connecting all patches from two modalities, we explore the cross-modal complementarity hierarchically.
We present a Feature Pyramid module for Transformer (FPT) to boost informative cross-scale integration as well as a consistency-complementarity module to disentangle the multi-modal integration path.
arXiv Detail & Related papers (2023-02-16T03:23:23Z) - Blur Interpolation Transformer for Real-World Motion from Blur [52.10523711510876]
We propose a encoded blur transformer (BiT) to unravel the underlying temporal correlation in blur.
Based on multi-scale residual Swin transformer blocks, we introduce dual-end temporal supervision and temporally symmetric ensembling strategies.
In addition, we design a hybrid camera system to collect the first real-world dataset of one-to-many blur-sharp video pairs.
arXiv Detail & Related papers (2022-11-21T13:10:10Z) - Effective Invertible Arbitrary Image Rescaling [77.46732646918936]
Invertible Neural Networks (INN) are able to increase upscaling accuracy significantly by optimizing the downscaling and upscaling cycle jointly.
A simple and effective invertible arbitrary rescaling network (IARN) is proposed to achieve arbitrary image rescaling by training only one model in this work.
It is shown to achieve a state-of-the-art (SOTA) performance in bidirectional arbitrary rescaling without compromising perceptual quality in LR outputs.
arXiv Detail & Related papers (2022-09-26T22:22:30Z) - Activating More Pixels in Image Super-Resolution Transformer [53.87533738125943]
Transformer-based methods have shown impressive performance in low-level vision tasks, such as image super-resolution.
We propose a novel Hybrid Attention Transformer (HAT) to activate more input pixels for better reconstruction.
Our overall method significantly outperforms the state-of-the-art methods by more than 1dB.
arXiv Detail & Related papers (2022-05-09T17:36:58Z) - Transformer-Guided Convolutional Neural Network for Cross-View
Geolocalization [20.435023745201878]
We propose a novel Transformer-guided convolutional neural network (TransGCNN) architecture.
Our TransGCNN consists of a CNN backbone extracting feature map from an input image and a Transformer head modeling global context.
Experiments on popular benchmark datasets demonstrate that our model achieves top-1 accuracy of 94.12% and 84.92% on CVUSA and CVACT_val, respectively.
arXiv Detail & Related papers (2022-04-21T08:46:41Z) - Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z) - Semantic Correspondence with Transformers [68.37049687360705]
We propose Cost Aggregation with Transformers (CATs) to find dense correspondences between semantically similar images.
We include appearance affinity modelling to disambiguate the initial correlation maps and multi-level aggregation.
We conduct experiments to demonstrate the effectiveness of the proposed model over the latest methods and provide extensive ablation studies.
arXiv Detail & Related papers (2021-06-04T14:39:03Z) - TFill: Image Completion via a Transformer-Based Architecture [69.62228639870114]
We propose treating image completion as a directionless sequence-to-sequence prediction task.
We employ a restrictive CNN with small and non-overlapping RF for token representation.
In a second phase, to improve appearance consistency between visible and generated regions, a novel attention-aware layer (AAL) is introduced.
arXiv Detail & Related papers (2021-04-02T01:42:01Z) - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets.
For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales.
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.