Image-Text Knowledge Modeling for Unsupervised Multi-Scenario Person Re-Identification
- URL: http://arxiv.org/abs/2601.11243v1
- Date: Fri, 16 Jan 2026 12:45:01 GMT
- Title: Image-Text Knowledge Modeling for Unsupervised Multi-Scenario Person Re-Identification
- Authors: Zhiqi Pang, Lingling Zhao, Yang Liu, Chunyu Wang, Gaurav Sharma,
- Abstract summary: Unsupervised multi-scenario (UMS) person re-identification (ReID) is a new task that expands ReID across diverse scenarios.<n>We introduce image-text knowledge modeling (ITKM) -- a three-stage framework that effectively exploits the representational power of vision-language models.
- Score: 23.27053860130735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose unsupervised multi-scenario (UMS) person re-identification (ReID) as a new task that expands ReID across diverse scenarios (cross-resolution, clothing change, etc.) within a single coherent framework. To tackle UMS-ReID, we introduce image-text knowledge modeling (ITKM) -- a three-stage framework that effectively exploits the representational power of vision-language models. We start with a pre-trained CLIP model with an image encoder and a text encoder. In Stage I, we introduce a scenario embedding in the image encoder and fine-tune the encoder to adaptively leverage knowledge from multiple scenarios. In Stage II, we optimize a set of learned text embeddings to associate with pseudo-labels from Stage I and introduce a multi-scenario separation loss to increase the divergence between inter-scenario text representations. In Stage III, we first introduce cluster-level and instance-level heterogeneous matching modules to obtain reliable heterogeneous positive pairs (e.g., a visible image and an infrared image of the same person) within each scenario. Next, we propose a dynamic text representation update strategy to maintain consistency between text and image supervision signals. Experimental results across multiple scenarios demonstrate the superiority and generalizability of ITKM; it not only outperforms existing scenario-specific methods but also enhances overall performance by integrating knowledge from multiple scenarios.
Related papers
- Hierarchical Concept-to-Appearance Guidance for Multi-Subject Image Generation [22.845591588026366]
We propose a framework that provides explicit, structured supervision from high-level concepts to fine-grained appearances.<n>At the conceptual level, we introduce a VAE dropout training strategy that randomly omits reference VAE features.<n>At the appearance level, we integrate the VLM-derived correspondences into a correspondence-aware masked attention module.
arXiv Detail & Related papers (2026-02-03T12:13:29Z) - Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents [99.62178668680578]
We propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer.<n> VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images.<n>To capture complex cross-modal relationships in web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments.
arXiv Detail & Related papers (2025-10-21T14:59:29Z) - IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation [23.61167100602915]
IUT-Plug is a module grounded in an Image Understanding Tree (IUT)<n>A dynamic IUT-Plug extraction module parses visual scenes into hierarchical symbolic structures.<n>A coordinated narrative-flow and image synthesis mechanism ensures cross-modal consistency.
arXiv Detail & Related papers (2025-10-13T03:19:45Z) - Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z) - Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation [35.50570174431677]
We propose a novel multi-resolution paradigm leveraging Whole Slide Images (WSIs) to extract histology patches at multiple resolutions.<n>We introduce visual-textual alignment at multiple resolutions as well as cross-resolution alignment to establish more effective text-guided visual representations.<n>Our model aims to capture a broader range of information, supported by novel loss functions, enriches feature representation, improves discriminative ability, and enhances generalization across different resolutions.
arXiv Detail & Related papers (2025-04-26T08:44:04Z) - ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)<n>We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - DSE-GAN: Dynamic Semantic Evolution Generative Adversarial Network for
Text-to-Image Generation [71.87682778102236]
We propose a novel Dynamical Semantic Evolution GAN (DSE-GAN) to re-compose each stage's text features under a novel single adversarial multi-stage architecture.
DSE-GAN achieves 7.48% and 37.8% relative FID improvement on two widely used benchmarks.
arXiv Detail & Related papers (2022-09-03T06:13:26Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.