Related papers: Exploring Part-Informed Visual-Language Learning for Person Re-Identification

Exploring Part-Informed Visual-Language Learning for Person Re-Identification

URL: http://arxiv.org/abs/2308.02738v1
Date: Fri, 4 Aug 2023 23:13:49 GMT
Title: Exploring Part-Informed Visual-Language Learning for Person Re-Identification
Authors: Yin Lin, Cong Liu, Yehansen Chen, Jinshui Hu, Bing Yin, Baocai Yin, Zengfu Wang
Abstract summary: We propose to enhance fine-grained visual features with part-informed language supervision for visual-based person re-identification tasks. Our $pi$-VL achieves substantial improvements over previous state-of-the-arts on four common-used ReID benchmarks.
Score: 40.725052076983516
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, visual-language learning has shown great potential in enhancing visual-based person re-identification (ReID). Existing visual-language learning-based ReID methods often focus on whole-body scale image-text feature alignment, while neglecting supervisions on fine-grained part features. This choice simplifies the learning process but cannot guarantee within-part feature semantic consistency thus hindering the final performance. Therefore, we propose to enhance fine-grained visual features with part-informed language supervision for ReID tasks. The proposed method, named Part-Informed Visual-language Learning ($\pi$-VL), suggests that (i) a human parsing-guided prompt tuning strategy and (ii) a hierarchical fusion-based visual-language alignment paradigm play essential roles in ensuring within-part feature semantic consistency. Specifically, we combine both identity labels and parsing maps to constitute pixel-level text prompts and fuse multi-stage visual features with a light-weight auxiliary head to perform fine-grained image-text alignment. As a plug-and-play and inference-free solution, our $\pi$-VL achieves substantial improvements over previous state-of-the-arts on four common-used ReID benchmarks, especially reporting 90.3% Rank-1 and 76.5% mAP for the most challenging MSMT17 database without bells and whistles.

Related papers

SAViL-Det: Semantic-Aware Vision-Language Model for Multi-Script Text Detection [4.013156524547072]
This paper introduces SAViL-Det, a novel semantic-aware vision-language model that enhances multi-script text detection.<n>The proposed framework adaptively propagates fine-grained semantic information from text prompts to visual features via cross-modal attention.<n>Experiments on challenging benchmarks demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2025-07-27T09:16:39Z)
Semantic-guided Representation Learning for Multi-Label Recognition [13.046479112800608]
Multi-label Recognition (MLR) involves assigning multiple labels to each data instance in an image. Recent Vision and Language Pre-training methods have made significant progress in tackling zero-shot MLR tasks. We introduce a Semantic-guided Representation Learning approach (SigRL) that enables the model to learn effective visual and textual representations.
arXiv Detail & Related papers (2025-04-04T08:15:08Z)
TULIP: Towards Unified Language-Image Pretraining [60.99500935831526]
We introduce T, an open-source, drop-in replacement for existing CLIP-like models. Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features. Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across benchmarks.
arXiv Detail & Related papers (2025-03-19T17:58:57Z)
ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI) We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z)
RSRefSeg: Referring Remote Sensing Image Segmentation with Foundation Models [24.67117013862316]
Referring remote sensing image segmentation is crucial for achieving fine-grained visual understanding. We introduce a referring remote sensing image segmentation foundational model, RSRefSeg. Experimental results on the RRSIS-D dataset demonstrate that RSRefSeg outperforms existing methods.
arXiv Detail & Related papers (2025-01-12T13:22:35Z)
Enhancing Visual Representation for Text-based Person Searching [9.601697802095119]
VFE-TPS is a Visual Feature Enhanced Text-based Person Search model. It introduces a pre-trained backbone CLIP to learn basic multimodal features. It constructs Text Guided Masked Image Modeling task to enhance the model's ability of learning local visual details.
arXiv Detail & Related papers (2024-12-30T01:38:14Z)
FLAIR: VLM with Fine-grained Language-informed Image Representations [49.2684130383925]
FLAIR is an approach that utilizes long and detailed image descriptions to learn localized image embeddings. Our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information.
arXiv Detail & Related papers (2024-12-04T18:56:04Z)
CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification [9.996589403019675]
Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP) We propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images. We introduce CLIP-SCGI, a framework that leverages synthesized captions to guide the learning of discriminative and robust representations.
arXiv Detail & Related papers (2024-10-12T06:24:33Z)
PVLR: Prompt-driven Visual-Linguistic Representation Learning for Multi-Label Image Recognition [47.11517266162346]
We propose a Prompt-driven Visual-Linguistic Representation Learning framework to better leverage the capabilities of the linguistic modality. In contrast to the unidirectional fusion in previous works, we introduce a Dual-Modal Attention (DMA) that enables bidirectional interaction between textual and visual features.
arXiv Detail & Related papers (2024-01-31T14:39:11Z)
ASPIRE: Language-Guided Data Augmentation for Improving Robustness Against Spurious Correlations [43.323791505213634]
ASPIRE (Language-guided Data Augmentation for SPurIous correlation REmoval) is a solution for supplementing the training dataset with images without spurious features. It can generate non-spurious images without requiring any group labeling or existing non-spurious images in the training set. It improves the worst-group classification accuracy of prior methods by 1% - 38%.
arXiv Detail & Related papers (2023-08-19T20:18:15Z)
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training [46.570154746311935]
We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language pre-training. Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features. Our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task.
arXiv Detail & Related papers (2023-07-13T21:08:15Z)
LPN: Language-guided Prototypical Network for few-shot classification [16.37959398470535]
Few-shot classification aims to adapt to new tasks with limited labeled examples. Recent methods explore suitable measures for the similarity between the query and support images. We propose a Language-guided Prototypical Network (LPN) for few-shot classification.
arXiv Detail & Related papers (2023-07-04T06:54:01Z)
Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition [92.6211155264297]
Vision models have gained increasing attention due to their simplicity and efficiency in Scene Text Recognition (STR) task. Recent vision models suffer from two problems: (1) the pure vision-based query results in attention drift, which usually causes poor recognition and is summarized as linguistic insensitive drift (LID) problem in this paper. We propose a $textbfL$inguistic $textbfP$erception $textbfV$ision model (LPV) which explores the linguistic capability of vision model for accurate text recognition.
arXiv Detail & Related papers (2023-05-09T02:52:47Z)
SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models. SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation. State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z)
Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels. We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction. We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)
TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning [119.43299939907685]
Zero-shot learning (ZSL) tackles the novel class recognition problem by transferring semantic knowledge from seen classes to unseen ones. Existing attention-based models have struggled to learn inferior region features in a single image by solely using unidirectional attention. We propose a cross attribute-guided Transformer network, termed TransZero++, to refine visual features and learn accurate attribute localization for semantic-augmented visual embedding representations.
arXiv Detail & Related papers (2021-12-16T05:49:51Z)
From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union. VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z)
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.