FitPro: A Zero-Shot Framework for Interactive Text-based Pedestrian Retrieval in Open World
- URL: http://arxiv.org/abs/2509.16674v2
- Date: Tue, 04 Nov 2025 06:23:32 GMT
- Title: FitPro: A Zero-Shot Framework for Interactive Text-based Pedestrian Retrieval in Open World
- Authors: Zengli Luo, Canlong Zhang, Xiaochun Lu, Zhixin Li,
- Abstract summary: Text-based Pedestrian Retrieval (TPR) deals with retrieving specific target pedestrians in visual scenes according to natural language descriptions.<n>We propose FitPro, an open-world interactive zero-shot TPR framework with enhanced semantic comprehension and cross-scene adaptability.
- Score: 13.089848592467675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-based Pedestrian Retrieval (TPR) deals with retrieving specific target pedestrians in visual scenes according to natural language descriptions. Although existing methods have achieved progress under constrained settings, interactive retrieval in the open-world scenario still suffers from limited model generalization and insufficient semantic understanding. To address these challenges, we propose FitPro, an open-world interactive zero-shot TPR framework with enhanced semantic comprehension and cross-scene adaptability. FitPro has three innovative components: Feature Contrastive Decoding (FCD), Incremental Semantic Mining (ISM), and Query-aware Hierarchical Retrieval (QHR). The FCD integrates prompt-guided contrastive decoding to generate high-quality structured pedestrian descriptions from denoised images, effectively alleviating semantic drift in zero-shot scenarios. The ISM constructs holistic pedestrian representations from multi-view observations to achieve global semantic modeling in multi-turn interactions, thereby improving robustness against viewpoint shifts and fine-grained variations in descriptions. The QHR dynamically optimizes the retrieval pipeline according to query types, enabling efficient adaptation to multi-modal and multi-view inputs. Extensive experiments on five public datasets and two evaluation protocols demonstrate that FitPro significantly overcomes the generalization limitations and semantic modeling constraints of existing methods in interactive retrieval, paving the way for practical deployment.
Related papers
- Chat-Driven Text Generation and Interaction for Person Retrieval [16.448356660477682]
We introduce two complementary modules: Multi-Turn Text Generation (MTG) and Multi-Turn Text Interaction (MTI)<n>MTG generates rich pseudo-labels through simulated dialogues with MLLMs, producing fine-grained and diverse visual descriptions without manual supervision.<n>MTI refines user queries at inference time through dynamic, dialogue-based reasoning, enabling the system to interpret and resolve vague, incomplete, or ambiguous descriptions.
arXiv Detail & Related papers (2025-09-16T04:40:24Z) - Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models [57.357091028792325]
Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment.<n>We propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment.<n>Our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS)
arXiv Detail & Related papers (2025-08-24T15:45:22Z) - Text-guided Visual Prompt DINO for Generic Segmentation [31.33676182634522]
We propose Prompt-DINO, a text-guided visual Prompt DINO framework.<n>First, we introduce an early fusion mechanism that unifies text/visual prompts and backbone features.<n>Second, we design order-aligned query selection for DETR-based architectures.<n>Third, we develop a generative data engine powered by the Recognize Anything via Prompting (RAP) model.
arXiv Detail & Related papers (2025-08-08T09:09:30Z) - Multimodal RAG Enhanced Visual Description [3.2771631221674333]
Pre-trained large multimodal models (LMMs) encounter a modality gap, characterised by a misalignment between textual and visual representations.<n>We propose a lightweight training-free approach utilising Retrieval-Augmented Generation (RAG) to extend across the modality.<n> Experimental results on two benchmark multimodal datasets demonstrate significant improvements.
arXiv Detail & Related papers (2025-08-06T19:04:38Z) - Multimodal Prompt Alignment for Facial Expression Recognition [24.470095812039286]
MPA-FER provides fine-grained semantic guidance to the learning process of prompted visual features.<n>Our framework outperforms state-of-the-art methods on three FER benchmark datasets.
arXiv Detail & Related papers (2025-06-26T05:28:57Z) - DiffRIS: Enhancing Referring Remote Sensing Image Segmentation with Pre-trained Text-to-Image Diffusion Models [9.109484087832058]
DiffRIS is a novel framework that harnesses the semantic understanding capabilities of pre-trained text-to-image diffusion models for RRSIS tasks.<n>Our framework introduces two key innovations: a context perception adapter (CP-adapter) and a cross-modal reasoning decoder (PCMRD)
arXiv Detail & Related papers (2025-06-23T02:38:56Z) - SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting [70.49268117587562]
We propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories.<n>During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories.
arXiv Detail & Related papers (2025-04-24T09:31:08Z) - Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.