Dual-Granularity Cross-Modal Identity Association for Weakly-Supervised Text-to-Person Image Matching
- URL: http://arxiv.org/abs/2507.06744v1
- Date: Wed, 09 Jul 2025 10:59:13 GMT
- Title: Dual-Granularity Cross-Modal Identity Association for Weakly-Supervised Text-to-Person Image Matching
- Authors: Yafei Zhang, Yongle Shang, Huafeng Li,
- Abstract summary: Weakly supervised text-to-person image matching is a crucial approach to reducing models' reliance on large-scale manually labeled samples.<n>We propose a dual-granularity identity association mechanism to predict complex one-to-many identity relationships.<n> Experimental results demonstrate that the proposed method substantially boosts cross-modal matching accuracy.
- Score: 7.1469465755934785
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly supervised text-to-person image matching, as a crucial approach to reducing models' reliance on large-scale manually labeled samples, holds significant research value. However, existing methods struggle to predict complex one-to-many identity relationships, severely limiting performance improvements. To address this challenge, we propose a local-and-global dual-granularity identity association mechanism. Specifically, at the local level, we explicitly establish cross-modal identity relationships within a batch, reinforcing identity constraints across different modalities and enabling the model to better capture subtle differences and correlations. At the global level, we construct a dynamic cross-modal identity association network with the visual modality as the anchor and introduce a confidence-based dynamic adjustment mechanism, effectively enhancing the model's ability to identify weakly associated samples while improving overall sensitivity. Additionally, we propose an information-asymmetric sample pair construction method combined with consistency learning to tackle hard sample mining and enhance model robustness. Experimental results demonstrate that the proposed method substantially boosts cross-modal matching accuracy, providing an efficient and practical solution for text-to-person image matching.
Related papers
- Weakly Supervised Visible-Infrared Person Re-Identification via Heterogeneous Expert Collaborative Consistency Learning [13.578738226091911]
This paper explores a weakly supervised cross-modal person ReID method that uses only single-modal sample identity labels.<n>We propose a heterogeneous expert collaborative consistency learning framework, designed to establish robust cross-modal identity correspondences.<n> Experimental results on two challenging datasets validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2025-07-17T09:31:34Z) - Ambiguity-Aware and High-Order Relation Learning for Multi-Grained Image-Text Matching [6.633576185707164]
This paper proposes the Ambiguity-Aware and High-order Relation learning framework (AAHR) to address these issues.<n>The framework introduces global and local feature extraction mechanisms and an adaptive aggregation network, significantly enhancing full-grained semantic understanding capabilities.<n> Experimental results demonstrate that AAHR outperforms existing state-of-the-art methods on Flickr30K, MSCOCO, and ECCV Caption datasets.
arXiv Detail & Related papers (2025-07-12T11:30:32Z) - Semantic-Aligned Learning with Collaborative Refinement for Unsupervised VI-ReID [82.12123628480371]
Unsupervised person re-identification (USL-VI-ReID) seeks to match pedestrian images of the same individual across different modalities without human annotations for model learning.<n>Previous methods unify pseudo-labels of cross-modality images through label association algorithms and then design contrastive learning framework for global feature learning.<n>We propose a Semantic-Aligned Learning with Collaborative Refinement (SALCR) framework, which builds up objective for specific fine-grained patterns emphasized by each modality.
arXiv Detail & Related papers (2025-04-27T13:58:12Z) - Mix-Modality Person Re-Identification: A New and Practical Paradigm [20.01921944345468]
We propose a new and more practical mix-modality retrieval paradigm.<n>Existing visible-infrared person re-identification (VI-ReID) methods have achieved some results in the bi-modality mutual retrieval paradigm.<n>This paper proposes a Mix-Modality person re-identification (MM-ReID) task, explores the influence of modality mixing ratio on performance, and constructs mix-modality test sets for existing datasets.
arXiv Detail & Related papers (2024-12-06T02:19:57Z) - Image-Feature Weak-to-Strong Consistency: An Enhanced Paradigm for Semi-Supervised Learning [5.0823084858349485]
Image-level weak-to-strong consistency serves as the predominant paradigm in semi-supervised learning(SSL)
We introduce feature-level perturbation with varying intensities and forms to expand the augmentation space.
We present a confidence-based identification strategy to distinguish between naive and challenging samples.
arXiv Detail & Related papers (2024-08-08T13:19:25Z) - Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis [89.04041100520881]
This research proposes to retrieve textual and visual evidence based on the object, sentence, and whole image.
We develop a novel approach to synthesize the object-level, image-level, and sentence-level information for better reasoning between the same and different modalities.
arXiv Detail & Related papers (2023-05-25T15:26:13Z) - Efficient Bilateral Cross-Modality Cluster Matching for Unsupervised Visible-Infrared Person ReID [56.573905143954015]
We propose a novel bilateral cluster matching-based learning framework to reduce the modality gap by matching cross-modality clusters.
Under such a supervisory signal, a Modality-Specific and Modality-Agnostic (MSMA) contrastive learning framework is proposed to align features jointly at a cluster-level.
Experiments on the public SYSU-MM01 and RegDB datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2023-05-22T03:27:46Z) - Probing Visual-Audio Representation for Video Highlight Detection via
Hard-Pairs Guided Contrastive Learning [23.472951216815765]
Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination.
In this paper, we enrich intra-modality and cross-modality relations for representation modeling.
We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
arXiv Detail & Related papers (2022-06-21T07:29:37Z) - Semantic Correspondence with Transformers [68.37049687360705]
We propose Cost Aggregation with Transformers (CATs) to find dense correspondences between semantically similar images.
We include appearance affinity modelling to disambiguate the initial correlation maps and multi-level aggregation.
We conduct experiments to demonstrate the effectiveness of the proposed model over the latest methods and provide extensive ablation studies.
arXiv Detail & Related papers (2021-06-04T14:39:03Z) - Unsupervised Person Re-identification via Simultaneous Clustering and
Consistency Learning [22.008371113710137]
We design a pretext task for unsupervised re-ID by learning visual consistency from still images and temporal consistency during training process.
We optimize the model by grouping the two encoded views into same cluster, thus enhancing the visual consistency between views.
arXiv Detail & Related papers (2021-04-01T02:10:42Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.