Related papers: PicT: A Slim Weakly Supervised Vision Transformer for Pavement Distress Classification

PicT: A Slim Weakly Supervised Vision Transformer for Pavement Distress Classification

URL: http://arxiv.org/abs/2209.10074v1
Date: Wed, 21 Sep 2022 02:33:49 GMT
Title: PicT: A Slim Weakly Supervised Vision Transformer for Pavement Distress Classification
Authors: Wenhao Tang and Sheng Huang and Xiaoxian Zhang and Luwen Huangfu
Abstract summary: We present a vision Transformer named textbfPavement textbfImage textbfClassification textbfPicT for pavement distress classification. textbfPicT outperforms the second-best performed model by a large margin.
Score: 10.826472503315912
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Automatic pavement distress classification facilitates improving the efficiency of pavement maintenance and reducing the cost of labor and resources. A recently influential branch of this task divides the pavement image into patches and addresses these issues from the perspective of multi-instance learning. However, these methods neglect the correlation between patches and suffer from a low efficiency in the model optimization and inference. Meanwhile, Swin Transformer is able to address both of these issues with its unique strengths. Built upon Swin Transformer, we present a vision Transformer named \textbf{P}avement \textbf{I}mage \textbf{C}lassification \textbf{T}ransformer (\textbf{PicT}) for pavement distress classification. In order to better exploit the discriminative information of pavement images at the patch level, the \textit{Patch Labeling Teacher} is proposed to leverage a teacher model to dynamically generate pseudo labels of patches from image labels during each iteration, and guides the model to learn the discriminative features of patches. The broad classification head of Swin Transformer may dilute the discriminative features of distressed patches in the feature aggregation step due to the small distressed area ratio of the pavement image. To overcome this drawback, we present a \textit{Patch Refiner} to cluster patches into different groups and only select the highest distress-risk group to yield a slim head for the final image classification. We evaluate our method on CQU-BPDD. Extensive results show that \textbf{PicT} outperforms the second-best performed model by a large margin of $+2.4\%$ in P@R on detection task, $+3.9\%$ in $F1$ on recognition task, and 1.8x throughput, while enjoying 7x faster training speed using the same computing resources. Our codes and models have been released on \href{https://github.com/DearCaat/PicT}{https://github.com/DearCaat/PicT}.

Related papers

Next Patch Prediction for Autoregressive Visual Generation [58.73461205369825]
We extend the Next Token Prediction (NTP) paradigm to a novel Next Patch Prediction (NPP) paradigm. Our key idea is to group and aggregate image tokens into patch tokens with higher information density. We show that NPP could reduce the training cost to around 0.6 times while improving image generation quality by up to 1.0 FID score on the ImageNet 256x256 generation benchmark.
arXiv Detail & Related papers (2024-12-19T18:59:36Z)
Semi-supervised 3D Object Detection with PatchTeacher and PillarMix [71.4908268136439]
Current semi-supervised 3D object detection methods typically use a teacher to generate pseudo labels for a student. We propose PatchTeacher, which focuses on partial scene 3D object detection to provide high-quality pseudo labels for the student. We introduce three key techniques, i.e., Patch Normalizer, Quadrant Align, and Fovea Selection, to improve the performance of PatchTeacher.
arXiv Detail & Related papers (2024-07-13T06:58:49Z)
Learning to Rank Patches for Unbiased Image Redundancy Reduction [80.93989115541966]
Images suffer from heavy spatial redundancy because pixels in neighboring regions are spatially correlated. Existing approaches strive to overcome this limitation by reducing less meaningful image regions. We propose a self-supervised framework for image redundancy reduction called Learning to Rank Patches.
arXiv Detail & Related papers (2024-03-31T13:12:41Z)
Augmenting Prototype Network with TransMix for Few-shot Hyperspectral Image Classification [9.479240476603353]
We propose to augment the prototype network with TransMix for few-shot hyperspectral image classification(APNT) While taking the prototype network as the backbone, it adopts the transformer as feature extractor to learn the pixel-to-pixel relation. The proposed method has demonstrated sate of the art performance and better robustness for few-shot hyperspectral image classification.
arXiv Detail & Related papers (2024-01-22T06:56:52Z)
Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT) MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies. Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z)
Cascaded Cross-Attention Networks for Data-Efficient Whole-Slide Image Classification Using Transformers [0.11219061154635457]
Whole-Slide Imaging allows for the capturing and digitization of high-resolution images of histological specimen. transformer architecture has been proposed as a possible candidate for effectively leveraging the high-resolution information. We propose a novel cascaded cross-attention network (CCAN) based on the cross-attention mechanism that scales linearly with the number of extracted patches.
arXiv Detail & Related papers (2023-05-11T16:42:24Z)
DBAT: Dynamic Backward Attention Transformer for Material Segmentation with Cross-Resolution Patches [8.812837829361923]
We propose the Dynamic Backward Attention Transformer (DBAT) to aggregate cross-resolution features. Experiments show that our DBAT achieves an accuracy of 86.85%, which is the best performance among state-of-the-art real-time models. We further align features to semantic labels, performing network dissection, to infer that the proposed model can extract material-related features better than other methods.
arXiv Detail & Related papers (2023-05-06T03:47:20Z)
PATS: Patch Area Transportation with Subdivision for Local Feature Matching [78.67559513308787]
Local feature matching aims at establishing sparse correspondences between a pair of images. We propose Patch Area Transportation with Subdivision (PATS) to tackle this issue. PATS improves both matching accuracy and coverage, and shows superior performance in downstream tasks.
arXiv Detail & Related papers (2023-03-14T08:28:36Z)
Weakly Supervised Patch Label Inference Networks for Efficient Pavement Distress Detection and Recognition in the Wild [14.16549562799135]
We present Weakly Supervised Patch Label Inference Networks (WSPLIN) for efficiently addressing pavement image classification tasks. WSPLIN transforms the fully supervised pavement image classification problem into a weakly supervised pavement patch classification problem. We evaluate our method on a large-scale bituminous pavement distress dataset.
arXiv Detail & Related papers (2022-03-31T04:01:02Z)
HIPA: Hierarchical Patch Transformer for Single Image Super Resolution [62.7081074931892]
This paper presents HIPA, a novel Transformer architecture that progressively recovers the high resolution image using a hierarchical patch partition. We build a cascaded model that processes an input image in multiple stages, where we start with tokens with small patch sizes and gradually merge to the full resolution. Such a hierarchical patch mechanism not only explicitly enables feature aggregation at multiple resolutions but also adaptively learns patch-aware features for different image regions.
arXiv Detail & Related papers (2022-03-19T05:09:34Z)
Variable-Rate Deep Image Compression through Spatially-Adaptive Feature Transform [58.60004238261117]
We propose a versatile deep image compression network based on Spatial Feature Transform (SFT arXiv:1804.02815) Our model covers a wide range of compression rates using a single model, which is controlled by arbitrary pixel-wise quality maps. The proposed framework allows us to perform task-aware image compressions for various tasks.
arXiv Detail & Related papers (2021-08-21T17:30:06Z)
Mixed Supervision Learning for Whole Slide Image Classification [88.31842052998319]
We propose a mixed supervision learning framework for super high-resolution images. During the patch training stage, this framework can make use of coarse image-level labels to refine self-supervised learning. A comprehensive strategy is proposed to suppress pixel-level false positives and false negatives.
arXiv Detail & Related papers (2021-07-02T09:46:06Z)
A Hierarchical Transformation-Discriminating Generative Model for Few Shot Anomaly Detection [93.38607559281601]
We devise a hierarchical generative model that captures the multi-scale patch distribution of each training image. The anomaly score is obtained by aggregating the patch-based votes of the correct transformation across scales and image regions.
arXiv Detail & Related papers (2021-04-29T17:49:48Z)
An Iteratively Optimized Patch Label Inference Network for Automatic Pavement Distress Detection [12.89160593375335]
We present a novel deep learning framework named the Iteratively optimized Patch Label Inference Network (IOPLIN) for automatically detecting various pavement distresses. IOPLIN can be iteratively trained with only the image label via the Expectation-Maximization Inspired Patch Label Distillation strategy. It is able to handle images in different resolutions, and sufficiently utilize image information particularly for the high-resolution ones.
arXiv Detail & Related papers (2020-05-27T11:56:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.