Related papers: UP-Person: Unified Parameter-Efficient Transfer Learning for Text-based Person Retrieval

UP-Person: Unified Parameter-Efficient Transfer Learning for Text-based Person Retrieval

URL: http://arxiv.org/abs/2504.10084v1
Date: Mon, 14 Apr 2025 10:40:54 GMT
Title: UP-Person: Unified Parameter-Efficient Transfer Learning for Text-based Person Retrieval
Authors: Yating Liu, Yaowei Li, Xiangyuan Lan, Wenming Yang, Zimo Liu, Qingmin Liao,
Abstract summary: Text-based Person Retrieval (TPR) as a multi-modal task, which aims to retrieve the target person from a pool of candidate images given a text description, has recently garnered considerable attention.<n>Prior works leverage pre-trained CLIP to extract person visual and textual features and fully fine-tune the entire network.<n>We propose a novel Unified.<n>-Efficient Transfer Learning (PETL) method for Text-based Person Retrieval (UPPerson) to thoroughly transfer the.<n>multi-modal knowledge from CLIP.
Score: 47.018491164452094
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-based Person Retrieval (TPR) as a multi-modal task, which aims to retrieve the target person from a pool of candidate images given a text description, has recently garnered considerable attention due to the progress of contrastive visual-language pre-trained model. Prior works leverage pre-trained CLIP to extract person visual and textual features and fully fine-tune the entire network, which have shown notable performance improvements compared to uni-modal pre-training models. However, full-tuning a large model is prone to overfitting and hinders the generalization ability. In this paper, we propose a novel Unified Parameter-Efficient Transfer Learning (PETL) method for Text-based Person Retrieval (UP-Person) to thoroughly transfer the multi-modal knowledge from CLIP. Specifically, UP-Person simultaneously integrates three lightweight PETL components including Prefix, LoRA and Adapter, where Prefix and LoRA are devised together to mine local information with task-specific information prompts, and Adapter is designed to adjust global feature representations. Additionally, two vanilla submodules are optimized to adapt to the unified architecture of TPR. For one thing, S-Prefix is proposed to boost attention of prefix and enhance the gradient propagation of prefix tokens, which improves the flexibility and performance of the vanilla prefix. For another thing, L-Adapter is designed in parallel with layer normalization to adjust the overall distribution, which can resolve conflicts caused by overlap and interaction among multiple submodules. Extensive experimental results demonstrate that our UP-Person achieves state-of-the-art results across various person retrieval datasets, including CUHK-PEDES, ICFG-PEDES and RSTPReid while merely fine-tuning 4.7\% parameters. Code is available at https://github.com/Liu-Yating/UP-Person.

Related papers

Learning Pluralistic User Preferences through Reinforcement Learning Fine-tuned Summaries [13.187789731783095]
We present a novel framework that learns text-based summaries of each user's preferences, characteristics, and past conversations.<n>These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user.<n>We show that our method is robust to new users and diverse conversation topics.
arXiv Detail & Related papers (2025-07-17T23:48:51Z)
pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models [14.75695352321115]
pFedMMA is the first personalized federated learning framework that leverages multi-modal adapters for vision-language tasks.<n>We show that pFedMMA achieves state-of-the-art trade-offs between personalization and generalization, outperforming recent federated prompt tuning methods.
arXiv Detail & Related papers (2025-07-07T18:26:34Z)
MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models [4.828668077793944]
Multi-Modal Representation Learning generates space tokens projected into both text and image encoders as representation tokens.<n>MML++ is a parameter-efficient and interaction-aware extension that significantly reduces trainable parameters.<n> experiments on 15 datasets demonstrate that MMRL and MMRL++ consistently outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-05-15T08:43:53Z)
CAMeL: Cross-modality Adaptive Meta-Learning for Text-based Person Retrieval [22.01591564940522]
We introduce a domain-agnostic pretraining framework based on Cross-modality Adaptive Meta-Learning (CAMeL) to enhance the model generalization capability. In particular, we develop a series of tasks that reflect the diversity and complexity of real-world scenarios. Our proposed model not only surpasses existing state-of-the-art methods on real-world benchmarks, but also showcases robustness and scalability.
arXiv Detail & Related papers (2025-04-26T03:26:30Z)
Enhancing Visual Representation for Text-based Person Searching [9.601697802095119]
VFE-TPS is a Visual Feature Enhanced Text-based Person Search model. It introduces a pre-trained backbone CLIP to learn basic multimodal features. It constructs Text Guided Masked Image Modeling task to enhance the model's ability of learning local visual details.
arXiv Detail & Related papers (2024-12-30T01:38:14Z)
Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP [24.22470408549266]
We dub prompt embedding as Aggregate-and-Adapted Prompt Embedding (AAPE) AAPE is shown to be able to generalize to different downstream data distributions and tasks, including vision-language understanding tasks. We also show AAPE is particularly helpful to handle non-canonical and OOD examples.
arXiv Detail & Related papers (2024-10-31T07:41:13Z)
CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing [66.6712018832575]
Domain generalization (DG) based Face Anti-Spoofing (FAS) aims to improve the model's performance on unseen domains. We make use of large-scale VLMs like CLIP and leverage the textual feature to dynamically adjust the classifier's weights for exploring generalizable visual features.
arXiv Detail & Related papers (2024-03-21T11:58:50Z)
Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts. RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z)
Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion [23.62010759076202]
We formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels. Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy.
arXiv Detail & Related papers (2023-12-17T11:59:14Z)
APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z)
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z)
PLIP: Language-Image Pre-training for Person Representation Learning [51.348303233290025]
We propose a novel language-image pre-training framework for person representation learning, termed PLIP. To implement our framework, we construct a large-scale person dataset with image-text pairs named SYNTH-PEDES. PLIP not only significantly improves existing methods on all these tasks, but also shows great ability in the zero-shot and domain generalization settings.
arXiv Detail & Related papers (2023-05-15T06:49:00Z)
CLIP-Adapter: Better Vision-Language Models with Feature Adapters [84.88106370842883]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.<n>CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending.<n> Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)
Adaptive Prototypical Networks with Label Words and Joint Representation Learning for Few-Shot Relation Classification [17.237331828747006]
This work focuses on few-shot relation classification (FSRC) We propose an adaptive mixture mechanism to add label words to the representation of the class prototype. Experiments have been conducted on FewRel under different few-shot (FS) settings.
arXiv Detail & Related papers (2021-01-10T11:25:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.