Dual-Stream Attention Transformers for Sewer Defect Classification
- URL: http://arxiv.org/abs/2311.16145v1
- Date: Tue, 7 Nov 2023 02:31:51 GMT
- Title: Dual-Stream Attention Transformers for Sewer Defect Classification
- Authors: Abdullah Al Redwan Newaz, Mahdi Abdeldguerfi, Kendall N. Niles, and
Joe Tom
- Abstract summary: We propose a dual-stream vision transformer architecture that processes RGB and optical flow inputs for efficient sewer defect classification.
Our key idea is to use self-attention regularization to harness the complementary strengths of the RGB and motion streams.
By leveraging motion cues through a self-attention regularizer, we align and enhance RGB attention maps, enabling the network to concentrate on pertinent input regions.
- Score: 2.5499055723658097
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a dual-stream multi-scale vision transformer (DS-MSHViT)
architecture that processes RGB and optical flow inputs for efficient sewer
defect classification. Unlike existing methods that combine the predictions of
two separate networks trained on each modality, we jointly train a single
network with two branches for RGB and motion. Our key idea is to use
self-attention regularization to harness the complementary strengths of the RGB
and motion streams. The motion stream alone struggles to generate accurate
attention maps, as motion images lack the rich visual features present in RGB
images. To facilitate this, we introduce an attention consistency loss between
the dual streams. By leveraging motion cues through a self-attention
regularizer, we align and enhance RGB attention maps, enabling the network to
concentrate on pertinent input regions. We evaluate our data on a public
dataset as well as cross-validate our model performance in a novel dataset. Our
method outperforms existing models that utilize either convolutional neural
networks (CNNs) or multi-scale hybrid vision transformers (MSHViTs) without
employing attention regularization between the two streams.
Related papers
- A Hybrid Transformer-Mamba Network for Single Image Deraining [70.64069487982916]
Existing deraining Transformers employ self-attention mechanisms with fixed-range windows or along channel dimensions.
We introduce a novel dual-branch hybrid Transformer-Mamba network, denoted as TransMamba, aimed at effectively capturing long-range rain-related dependencies.
arXiv Detail & Related papers (2024-08-31T10:03:19Z) - DeblurDiNAT: A Generalizable Transformer for Perceptual Image Deblurring [1.5124439914522694]
DeblurDiNAT is a generalizable and efficient encoder-decoder Transformer which restores clean images visually close to the ground truth.
We present a linear feed-forward network and a non-linear dual-stage feature fusion module for faster feature propagation across the network.
arXiv Detail & Related papers (2024-03-19T21:31:31Z) - Dual Aggregation Transformer for Image Super-Resolution [92.41781921611646]
We propose a novel Transformer model, Dual Aggregation Transformer, for image SR.
Our DAT aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner.
Our experiments show that our DAT surpasses current methods.
arXiv Detail & Related papers (2023-08-07T07:39:39Z) - Alignment-free HDR Deghosting with Semantics Consistent Transformer [76.91669741684173]
High dynamic range imaging aims to retrieve information from multiple low-dynamic range inputs to generate realistic output.
Existing methods often focus on the spatial misalignment across input frames caused by the foreground and/or camera motion.
We propose a novel alignment-free network with a Semantics Consistent Transformer (SCTNet) with both spatial and channel attention modules.
arXiv Detail & Related papers (2023-05-29T15:03:23Z) - Bridging the Gap Between Vision Transformers and Convolutional Neural
Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets.
We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases.
Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z) - CIR-Net: Cross-modality Interaction and Refinement for RGB-D Salient
Object Detection [144.66411561224507]
We present a convolutional neural network (CNN) model, named CIR-Net, based on the novel cross-modality interaction and refinement.
Our network outperforms the state-of-the-art saliency detectors both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-10-06T11:59:19Z) - MFGNet: Dynamic Modality-Aware Filter Generation for RGB-T Tracking [72.65494220685525]
We propose a new dynamic modality-aware filter generation module (named MFGNet) to boost the message communication between visible and thermal data.
We generate dynamic modality-aware filters with two independent networks. The visible and thermal filters will be used to conduct a dynamic convolutional operation on their corresponding input feature maps respectively.
To address issues caused by heavy occlusion, fast motion, and out-of-view, we propose to conduct a joint local and global search by exploiting a new direction-aware target-driven attention mechanism.
arXiv Detail & Related papers (2021-07-22T03:10:51Z) - SFANet: A Spectrum-aware Feature Augmentation Network for
Visible-Infrared Person Re-Identification [12.566284647658053]
We propose a novel spectrum-aware feature augementation network named SFANet for cross-modality matching problem.
Learning with grayscale-spectrum images, our model can apparently reduce modality discrepancy and detect inner structure relations.
In feature-level, we improve the conventional two-stream network through balancing the number of specific and sharable convolutional blocks.
arXiv Detail & Related papers (2021-02-24T08:57:32Z) - Efficient Two-Stream Network for Violence Detection Using Separable
Convolutional LSTM [0.0]
We propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet.
SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution.
Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin.
arXiv Detail & Related papers (2021-02-21T12:01:48Z) - AdaptiveWeighted Attention Network with Camera Spectral Sensitivity
Prior for Spectral Reconstruction from RGB Images [22.26917280683572]
We propose a novel adaptive weighted attention network (AWAN) for spectral reconstruction.
AWCA and PSNL modules are developed to reallocate channel-wise feature responses.
In the NTIRE 2020 Spectral Reconstruction Challenge, our entries obtain the 1st ranking on the Clean track and the 3rd place on the Real World track.
arXiv Detail & Related papers (2020-05-19T09:21:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.