Related papers: CLIP-SENet: CLIP-based Semantic Enhancement Network for Vehicle Re-identification

CLIP-SENet: CLIP-based Semantic Enhancement Network for Vehicle Re-identification

URL: http://arxiv.org/abs/2502.16815v1
Date: Mon, 24 Feb 2025 03:52:37 GMT
Title: CLIP-SENet: CLIP-based Semantic Enhancement Network for Vehicle Re-identification
Authors: Liping Lu, Zihao Fu, Duanfeng Chu, Wei Wang, Bingrong Xu,
Abstract summary: We propose a CLIP-based Semantic Enhancement Network (CLIP-SENet) to enhance vehicle Re-ID.<n>CLIP-SENet is an end-to-end framework designed to autonomously extract and refine vehicle semantic attributes.<n>Our approach achieves new state-of-the-art performance, with 92.9% mAP and 98.7% Rank-1 on VeRi-776 dataset, 90.4% Rank-1 and 98.7% Rank-5 on VehicleID dataset, and 89.1% mAP and 97.9% Rank-1 on the more challenging VeRi-Wild dataset.
Score: 11.817329389930489
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vehicle re-identification (Re-ID) is a crucial task in intelligent transportation systems (ITS), aimed at retrieving and matching the same vehicle across different surveillance cameras. Numerous studies have explored methods to enhance vehicle Re-ID by focusing on semantic enhancement. However, these methods often rely on additional annotated information to enable models to extract effective semantic features, which brings many limitations. In this work, we propose a CLIP-based Semantic Enhancement Network (CLIP-SENet), an end-to-end framework designed to autonomously extract and refine vehicle semantic attributes, facilitating the generation of more robust semantic feature representations. Inspired by zero-shot solutions for downstream tasks presented by large-scale vision-language models, we leverage the powerful cross-modal descriptive capabilities of the CLIP image encoder to initially extract general semantic information. Instead of using a text encoder for semantic alignment, we design an adaptive fine-grained enhancement module (AFEM) to adaptively enhance this general semantic information at a fine-grained level to obtain robust semantic feature representations. These features are then fused with common Re-ID appearance features to further refine the distinctions between vehicles. Our comprehensive evaluation on three benchmark datasets demonstrates the effectiveness of CLIP-SENet. Our approach achieves new state-of-the-art performance, with 92.9% mAP and 98.7% Rank-1 on VeRi-776 dataset, 90.4% Rank-1 and 98.7% Rank-5 on VehicleID dataset, and 89.1% mAP and 97.9% Rank-1 on the more challenging VeRi-Wild dataset.

Related papers

DOEI: Dual Optimization of Embedding Information for Attention-Enhanced Class Activation Maps [30.53564087005569]
Weakly supervised semantic segmentation (WSSS) typically utilizes limited semantic annotations to obtain initial Class Activation Maps (CAMs)<n>Due to the inadequate coupling between class activation responses and semantic information in high-dimensional space, the CAM is prone to object co-occurrence or under-activation.<n>We propose DOEI, Dual Optimization of Embedding Information, a novel approach that reconstructs embedding representations through semantic-aware attention weight matrices.
arXiv Detail & Related papers (2025-02-21T19:06:01Z)
Object Re-identification via Spatial-temporal Fusion Networks and Causal Identity Matching [4.123763595394021]
We introduce a novel ReID framework that leverages a spatial-temporal fusion network and causal identity matching (CIM) Our framework estimates camera network topology using a proposed adaptive Parzen window and combines appearance features with spatial-temporal cues within the fusion network. This approach has demonstrated outstanding performance across several datasets, including VeRi776, Vehicle-3I, and Market-1501, achieving up to 99.70% rank-1 accuracy and 95.5% mAP.
arXiv Detail & Related papers (2024-08-10T13:50:43Z)
VehicleGAN: Pair-flexible Pose Guided Image Synthesis for Vehicle Re-identification [27.075761782915496]
This paper proposes to synthesize a large number of vehicle images in the target pose. Considering the paired data of the same vehicles in different traffic surveillance cameras might be not available in the real world, we propose VehicleGAN. Because of the feature distribution difference between real and synthetic data, we propose a new Joint Metric Learning (JML) via effective feature-level fusion.
arXiv Detail & Related papers (2023-11-27T19:34:04Z)
VILLS -- Video-Image Learning to Learn Semantics for Person Re-Identification [51.89551385538251]
We propose VILLS (Video-Image Learning to Learn Semantics), a self-supervised method that jointly learns spatial and temporal features from images and videos. VILLS first designs a local semantic extraction module that adaptively extracts semantically consistent and robust spatial features. Then, VILLS designs a unified feature learning and adaptation module to represent image and video modalities in a consistent feature space.
arXiv Detail & Related papers (2023-11-27T19:30:30Z)
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal. We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z)
A High-Accuracy Unsupervised Person Re-identification Method Using Auxiliary Information Mined from Datasets [53.047542904329866]
We make use of auxiliary information mined from datasets for multi-modal feature learning. This paper proposes three effective training tricks, including Restricted Label Smoothing Cross Entropy Loss (RLSCE), Weight Adaptive Triplet Loss (WATL) and Dynamic Training Iterations (DTI)
arXiv Detail & Related papers (2022-05-06T10:16:18Z)
Learnable Online Graph Representations for 3D Multi-Object Tracking [156.58876381318402]
We propose a unified and learning based approach to the 3D MOT problem. We employ a Neural Message Passing network for data association that is fully trainable. We show the merit of the proposed approach on the publicly available nuScenes dataset by achieving state-of-the-art performance of 65.6% AMOTA and 58% fewer ID-switches.
arXiv Detail & Related papers (2021-04-23T17:59:28Z)
A Strong Baseline for Vehicle Re-Identification [1.9573380763700712]
Vehicle Re-ID aims to identify the same vehicle across different cameras. In this paper, we first analyze the main factors hindering the Vehicle Re-ID performance. We then present our solutions, specifically targeting the Track 2 of the 5th AI Challenge.
arXiv Detail & Related papers (2021-04-22T03:54:55Z)
Adversarial Feature Augmentation and Normalization for Visual Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models. Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings. We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z)
AttributeNet: Attribute Enhanced Vehicle Re-Identification [70.89289512099242]
We introduce AttributeNet (ANet) that jointly extracts identity-relevant features and attribute features. We enable the interaction by distilling the ReID-helpful attribute feature and adding it into the general ReID feature to increase the discrimination power. We validate the effectiveness of our framework on three challenging datasets.
arXiv Detail & Related papers (2021-02-07T19:51:02Z)
VehicleNet: Learning Robust Visual Representation for Vehicle Re-identification [116.1587709521173]
We propose to build a large-scale vehicle dataset (called VehicleNet) by harnessing four public vehicle datasets. We design a simple yet effective two-stage progressive approach to learning more robust visual representation from VehicleNet. We achieve the state-of-art accuracy of 86.07% mAP on the private test set of AICity Challenge.
arXiv Detail & Related papers (2020-04-14T05:06:38Z)
Attribute-guided Feature Learning Network for Vehicle Re-identification [13.75036137728257]
Vehicle re-identification (reID) plays an important role in the automatic analysis of the increasing urban surveillance videos. This paper proposes a novel Attribute-Guided Network (AGNet), which could learn global representation with the abundant attribute features.
arXiv Detail & Related papers (2020-01-12T06:57:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.