Related papers: Adaptive Aspect Ratios with Patch-Mixup-ViT-based Vehicle ReID

Adaptive Aspect Ratios with Patch-Mixup-ViT-based Vehicle ReID

URL: http://arxiv.org/abs/2411.06297v1
Date: Sat, 09 Nov 2024 21:49:45 GMT
Title: Adaptive Aspect Ratios with Patch-Mixup-ViT-based Vehicle ReID
Authors: Mei Qiu, Lauren Ann Christopher, Stanley Chien, Lingxi Li,
Abstract summary: Non-square aspect ratios of image or video inputs can negatively impact re-identification accuracy. We propose a novel ViT-based ReID framework that fuses models trained on various aspect ratios. Our method outperforms state-of-the-art transformer-based approaches on both datasets.
Score: 3.834614490767914
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Vision Transformers (ViTs) have shown exceptional performance in vehicle re-identification (ReID) tasks. However, non-square aspect ratios of image or video inputs can negatively impact re-identification accuracy. To address this challenge, we propose a novel, human perception driven, and general ViT-based ReID framework that fuses models trained on various aspect ratios. Our key contributions are threefold: (i) We analyze the impact of aspect ratios on performance using the VeRi-776 and VehicleID datasets, providing guidance for input settings based on the distribution of original image aspect ratios. (ii) We introduce patch-wise mixup strategy during ViT patchification (guided by spatial attention scores) and implement uneven stride for better alignment with object aspect ratios. (iii) We propose a dynamic feature fusion ReID network to enhance model robustness. Our method outperforms state-of-the-art transformer-based approaches on both datasets, with only a minimal increase in inference time per image.

Related papers

EVT: Efficient View Transformation for Multi-Modal 3D Object Detection [2.9848894641223302]
Multi-modal sensor fusion in Bird's Eye View (BEV) representation has become the leading approach for 3D object detection. We propose Efficient View Transformation (EVT), a novel 3D object detection framework that constructs a well-structured BEV representation. On the nuScenes test set, EVT achieves state-of-the-art performance of 75.3% NDS with real-time inference speed.
arXiv Detail & Related papers (2024-11-16T06:11:10Z)
UniTT-Stereo: Unified Training of Transformer for Enhanced Stereo Matching [18.02254687807291]
UniTT-Stereo is a method to maximize the potential of Transformer-based stereo architectures. State-of-the-art performance of UniTT-Stereo is validated on various benchmarks such as ETH3D, KITTI 2012, and KITTI 2015 datasets.
arXiv Detail & Related papers (2024-09-04T09:02:01Z)
Study on Aspect Ratio Variability toward Robustness of Vision Transformer-based Vehicle Re-identification [4.189040854337193]
We propose a novel ViT-based ReID framework, which fuses models trained on a variety of aspect ratios. Our ReID method achieves a significantly improved mean Average Precision (mAP) of 91.0% compared to the the closest state-of-the-art (CAL) result of 80.9% on VehicleID dataset.
arXiv Detail & Related papers (2024-07-10T17:02:42Z)
V2X-AHD:Vehicle-to-Everything Cooperation Perception via Asymmetric Heterogenous Distillation Network [13.248981195106069]
We propose a multi-view vehicle-road cooperation perception system, vehicle-to-everything cooperative perception (V2X-AHD) The V2X-AHD can effectively improve the accuracy of 3D object detection and reduce the number of network parameters, according to this study.
arXiv Detail & Related papers (2023-10-10T13:12:03Z)
Unifying Flow, Stereo and Depth Estimation [121.54066319299261]
We present a unified formulation and model for three motion and 3D perception tasks. We formulate all three tasks as a unified dense correspondence matching problem. Our model naturally enables cross-task transfer since the model architecture and parameters are shared across tasks.
arXiv Detail & Related papers (2022-11-10T18:59:54Z)
DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency. The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on. Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z)
Optimizing Relevance Maps of Vision Transformers Improves Robustness [91.61353418331244]
It has been observed that visual classification models often rely mostly on the image background, neglecting the foreground, which hurts their robustness to distribution changes. We propose to monitor the model's relevancy signal and manipulate it such that the model is focused on the foreground object. This is done as a finetuning step, involving relatively few samples consisting of pairs of images and their associated foreground masks.
arXiv Detail & Related papers (2022-06-02T17:24:48Z)
AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
Parsing-based View-aware Embedding Network for Vehicle Re-Identification [138.11983486734576]
We propose a parsing-based view-aware embedding network (PVEN) to achieve the view-aware feature alignment and enhancement for vehicle ReID. The experiments conducted on three datasets show that our model outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2020-04-10T13:06:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.