Related papers: Study on Aspect Ratio Variability toward Robustness of Vision Transformer-based Vehicle Re-identification

Study on Aspect Ratio Variability toward Robustness of Vision Transformer-based Vehicle Re-identification

URL: http://arxiv.org/abs/2407.07842v1
Date: Wed, 10 Jul 2024 17:02:42 GMT
Title: Study on Aspect Ratio Variability toward Robustness of Vision Transformer-based Vehicle Re-identification
Authors: Mei Qiu, Lauren Christopher, Lingxi Li,
Abstract summary: We propose a novel ViT-based ReID framework, which fuses models trained on a variety of aspect ratios. Our ReID method achieves a significantly improved mean Average Precision (mAP) of 91.0% compared to the the closest state-of-the-art (CAL) result of 80.9% on VehicleID dataset.
Score: 4.189040854337193
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Vision Transformers (ViTs) have excelled in vehicle re-identification (ReID) tasks. However, non-square aspect ratios of image or video input might significantly affect the re-identification performance. To address this issue, we propose a novel ViT-based ReID framework in this paper, which fuses models trained on a variety of aspect ratios. Our main contributions are threefold: (i) We analyze aspect ratio performance on VeRi-776 and VehicleID datasets, guiding input settings based on aspect ratios of original images. (ii) We introduce patch-wise mixup intra-image during ViT patchification (guided by spatial attention scores) and implement uneven stride for better object aspect ratio matching. (iii) We propose a dynamic feature fusing ReID network, enhancing model robustness. Our ReID method achieves a significantly improved mean Average Precision (mAP) of 91.0\% compared to the the closest state-of-the-art (CAL) result of 80.9\% on VehicleID dataset.

Related papers

Efficient Scaling of Diffusion Transformers for Text-to-Image Generation [105.7324182618969]
We study the scaling properties of various Diffusion Transformers (DiTs) for text-to-image generation by performing extensive and rigorous ablations. We find that U-ViT, a pure self-attention based DiT model provides a simpler design and scales more effectively in comparison with cross-attention based DiT variants.
arXiv Detail & Related papers (2024-12-16T22:59:26Z)
Adaptive Aspect Ratios with Patch-Mixup-ViT-based Vehicle ReID [3.834614490767914]
Non-square aspect ratios of image or video inputs can negatively impact re-identification accuracy. We propose a novel ViT-based ReID framework that fuses models trained on various aspect ratios. Our method outperforms state-of-the-art transformer-based approaches on both datasets.
arXiv Detail & Related papers (2024-11-09T21:49:45Z)
Optimization of Autonomous Driving Image Detection Based on RFAConv and Triplet Attention [1.345669927504424]
This paper proposes a holistic approach to enhance the YOLOv8 model. C2f_RFAConv module replaces the original module to enhance feature extraction efficiency. The Triplet Attention mechanism enhances feature focus for enhanced target detection.
arXiv Detail & Related papers (2024-06-25T08:59:33Z)
ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections [8.372189962601077]
Vision Transformer (ViT) self-attention mechanism is characterized by feature collapse in deeper layers. We propose a novel residual attention learning method for improving ViT-based architectures.
arXiv Detail & Related papers (2024-02-17T14:44:10Z)
Image Deblurring by Exploring In-depth Properties of Transformer [86.7039249037193]
We leverage deep features extracted from a pretrained vision transformer (ViT) to encourage recovered images to be sharp without sacrificing the performance measured by the quantitative metrics. By comparing the transformer features between recovered image and target one, the pretrained transformer provides high-resolution blur-sensitive semantic information. One regards the features as vectors and computes the discrepancy between representations extracted from recovered image and target one in Euclidean space.
arXiv Detail & Related papers (2023-03-24T14:14:25Z)
Optimizing Relevance Maps of Vision Transformers Improves Robustness [91.61353418331244]
It has been observed that visual classification models often rely mostly on the image background, neglecting the foreground, which hurts their robustness to distribution changes. We propose to monitor the model's relevancy signal and manipulate it such that the model is focused on the foreground object. This is done as a finetuning step, involving relatively few samples consisting of pairs of images and their associated foreground masks.
arXiv Detail & Related papers (2022-06-02T17:24:48Z)
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE. ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context. We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z)
AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
Parsing-based View-aware Embedding Network for Vehicle Re-Identification [138.11983486734576]
We propose a parsing-based view-aware embedding network (PVEN) to achieve the view-aware feature alignment and enhancement for vehicle ReID. The experiments conducted on three datasets show that our model outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2020-04-10T13:06:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.