Exploring Part-Informed Visual-Language Learning for Person
Re-Identification
- URL: http://arxiv.org/abs/2308.02738v1
- Date: Fri, 4 Aug 2023 23:13:49 GMT
- Title: Exploring Part-Informed Visual-Language Learning for Person
Re-Identification
- Authors: Yin Lin, Cong Liu, Yehansen Chen, Jinshui Hu, Bing Yin, Baocai Yin,
Zengfu Wang
- Abstract summary: We propose to enhance fine-grained visual features with part-informed language supervision for visual-based person re-identification tasks.
Our $pi$-VL achieves substantial improvements over previous state-of-the-arts on four common-used ReID benchmarks.
- Score: 40.725052076983516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, visual-language learning has shown great potential in enhancing
visual-based person re-identification (ReID). Existing visual-language
learning-based ReID methods often focus on whole-body scale image-text feature
alignment, while neglecting supervisions on fine-grained part features. This
choice simplifies the learning process but cannot guarantee within-part feature
semantic consistency thus hindering the final performance. Therefore, we
propose to enhance fine-grained visual features with part-informed language
supervision for ReID tasks. The proposed method, named Part-Informed
Visual-language Learning ($\pi$-VL), suggests that (i) a human parsing-guided
prompt tuning strategy and (ii) a hierarchical fusion-based visual-language
alignment paradigm play essential roles in ensuring within-part feature
semantic consistency. Specifically, we combine both identity labels and parsing
maps to constitute pixel-level text prompts and fuse multi-stage visual
features with a light-weight auxiliary head to perform fine-grained image-text
alignment. As a plug-and-play and inference-free solution, our $\pi$-VL
achieves substantial improvements over previous state-of-the-arts on four
common-used ReID benchmarks, especially reporting 90.3% Rank-1 and 76.5% mAP
for the most challenging MSMT17 database without bells and whistles.
Related papers
- ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)
We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - Language-Inspired Relation Transfer for Few-shot Class-Incremental Learning [42.923762020491495]
We propose a new Language-inspired Relation Transfer (LRT) paradigm to understand objects by joint visual clues and text depictions.
Our proposed LRT outperforms the state-of-the-art models by over $13%$ and $7%$ on the final session of mini-ImageNet and CIFAR-100 FSCIL benchmarks.
arXiv Detail & Related papers (2025-01-10T10:59:27Z) - Enhancing Visual Representation for Text-based Person Searching [9.601697802095119]
VFE-TPS is a Visual Feature Enhanced Text-based Person Search model.
It introduces a pre-trained backbone CLIP to learn basic multimodal features.
It constructs Text Guided Masked Image Modeling task to enhance the model's ability of learning local visual details.
arXiv Detail & Related papers (2024-12-30T01:38:14Z) - CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification [9.996589403019675]
Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP)
We propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images.
We introduce CLIP-SCGI, a framework that leverages synthesized captions to guide the learning of discriminative and robust representations.
arXiv Detail & Related papers (2024-10-12T06:24:33Z) - Attend and Enrich: Enhanced Visual Prompt for Zero-Shot Learning [114.59476118365266]
We propose AENet, which endows semantic information into the visual prompt to distill semantic-enhanced prompt for visual representation enrichment.
AENet comprises two key steps: 1) exploring the concept-harmonized tokens for the visual and attribute modalities, grounded on the modal-sharing token that represents consistent visual-semantic concepts; and 2) yielding semantic-enhanced prompt via the visual residual refinement unit with attribute consistency supervision.
arXiv Detail & Related papers (2024-06-05T07:59:48Z) - Bootstrapping Vision-Language Learning with Decoupled Language
Pre-training [46.570154746311935]
We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language pre-training.
Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features.
Our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task.
arXiv Detail & Related papers (2023-07-13T21:08:15Z) - Linguistic More: Taking a Further Step toward Efficient and Accurate
Scene Text Recognition [92.6211155264297]
Vision models have gained increasing attention due to their simplicity and efficiency in Scene Text Recognition (STR) task.
Recent vision models suffer from two problems: (1) the pure vision-based query results in attention drift, which usually causes poor recognition and is summarized as linguistic insensitive drift (LID) problem in this paper.
We propose a $textbfL$inguistic $textbfP$erception $textbfV$ision model (LPV) which explores the linguistic capability of vision model for accurate text recognition.
arXiv Detail & Related papers (2023-05-09T02:52:47Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.