CAMeL: Cross-modality Adaptive Meta-Learning for Text-based Person Retrieval
- URL: http://arxiv.org/abs/2504.18782v1
- Date: Sat, 26 Apr 2025 03:26:30 GMT
- Title: CAMeL: Cross-modality Adaptive Meta-Learning for Text-based Person Retrieval
- Authors: Hang Yu, Jiahao Wen, Zhedong Zheng,
- Abstract summary: We introduce a domain-agnostic pretraining framework based on Cross-modality Adaptive Meta-Learning (CAMeL) to enhance the model generalization capability.<n>In particular, we develop a series of tasks that reflect the diversity and complexity of real-world scenarios.<n>Our proposed model not only surpasses existing state-of-the-art methods on real-world benchmarks, but also showcases robustness and scalability.
- Score: 22.01591564940522
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-based person retrieval aims to identify specific individuals within an image database using textual descriptions. Due to the high cost of annotation and privacy protection, researchers resort to synthesized data for the paradigm of pretraining and fine-tuning. However, these generated data often exhibit domain biases in both images and textual annotations, which largely compromise the scalability of the pre-trained model. Therefore, we introduce a domain-agnostic pretraining framework based on Cross-modality Adaptive Meta-Learning (CAMeL) to enhance the model generalization capability during pretraining to facilitate the subsequent downstream tasks. In particular, we develop a series of tasks that reflect the diversity and complexity of real-world scenarios, and introduce a dynamic error sample memory unit to memorize the history for errors encountered within multiple tasks. To further ensure multi-task adaptation, we also adopt an adaptive dual-speed update strategy, balancing fast adaptation to new tasks and slow weight updates for historical tasks. Albeit simple, our proposed model not only surpasses existing state-of-the-art methods on real-world benchmarks, including CUHK-PEDES, ICFG-PEDES, and RSTPReid, but also showcases robustness and scalability in handling biased synthetic images and noisy text annotations. Our code is available at https://github.com/Jahawn-Wen/CAMeL-reID.
Related papers
- Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation [67.31811007549489]
We propose a Rewriting-driven AugMentation (RAM) paradigm for Vision-Language Navigation (VLN)<n>Benefiting from our rewriting mechanism, new observation-instruction can be obtained in both simulator-free and labor-saving manners to promote generalization.<n> Experiments on both the discrete environments (R2R, REVERIE, and R4R) and continuous environments (R2R-CE) show the superior performance and impressive generalization ability of our method.
arXiv Detail & Related papers (2025-03-23T13:18:17Z) - Attribute-Based Robotic Grasping with Data-Efficient Adaptation [19.683833436076313]
We present an end-to-end encoder-decoder network to learn attribute-based robotic grasping.<n>Our approach achieves over 81% instance grasping success rate on unknown objects.
arXiv Detail & Related papers (2025-01-04T00:37:17Z) - Efficient Transfer Learning for Video-language Foundation Models [13.166348605993292]
We propose a parameter-efficient Multi-modalpatio Ssupervised-Temporal Adapter (MSTA) to enhance alignment between textual and visual representations.<n>We evaluate the effectiveness of our approach across four tasks: zero-shot transfer, few-shot learning, base-to-novel generalization, and fully-Temporal learning.
arXiv Detail & Related papers (2024-11-18T01:25:58Z) - Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.
Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.
We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - Adversarial Robustification via Text-to-Image Diffusion Models [56.37291240867549]
Adrial robustness has been conventionally believed as a challenging property to encode for neural networks.
We develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data.
arXiv Detail & Related papers (2024-07-26T10:49:14Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - ASPIRE: Language-Guided Data Augmentation for Improving Robustness Against Spurious Correlations [43.323791505213634]
ASPIRE (Language-guided Data Augmentation for SPurIous correlation REmoval) is a solution for supplementing the training dataset with images without spurious features.
It can generate non-spurious images without requiring any group labeling or existing non-spurious images in the training set.
It improves the worst-group classification accuracy of prior methods by 1% - 38%.
arXiv Detail & Related papers (2023-08-19T20:18:15Z) - Gradient-Regulated Meta-Prompt Learning for Generalizable
Vision-Language Models [137.74524357614285]
We introduce a novel Gradient-RegulAted Meta-prompt learning framework.
It helps pre-training models adapt to downstream tasks in a parameter -- and data -- efficient way.
GRAM can be easily incorporated into various prompt tuning methods in a model-agnostic way.
arXiv Detail & Related papers (2023-03-12T05:03:37Z) - Improving Meta-learning for Low-resource Text Classification and
Generation via Memory Imitation [87.98063273826702]
We propose a memory imitation meta-learning (MemIML) method that enhances the model's reliance on support sets for task adaptation.
A theoretical analysis is provided to prove the effectiveness of our method.
arXiv Detail & Related papers (2022-03-22T12:41:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.