DualFocus: A Unified Framework for Integrating Positive and Negative Descriptors in Text-based Person Retrieval
- URL: http://arxiv.org/abs/2405.07459v1
- Date: Mon, 13 May 2024 04:21:00 GMT
- Title: DualFocus: A Unified Framework for Integrating Positive and Negative Descriptors in Text-based Person Retrieval
- Authors: Yuchuan Deng, Zhanpeng Hu, Jiakun Han, Chuang Deng, Qijun Zhao,
- Abstract summary: We introduce DualFocus, a framework for integrating positive and negative descriptors.
By focusing on token-level comparisons, DualFocus significantly outperforms existing techniques in both precision and robustness.
Experiment results highlight DualFocus's superior performance on CUHK-PEDES, ICFG-PEDES, and RSTPReid.
- Score: 6.381155145404096
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-based person retrieval (TPR) aims to retrieve images of a person from an extensive array of candidates based on a given textual description. The core challenge lies in mapping visual and textual data into a unified latent space. While existing TPR methods concentrate on recognizing explicit and positive characteristics, they often neglect the critical influence of negative descriptors, resulting in potential false positives that fulfill positive criteria but could be excluded by negative descriptors. To alleviate these issues, we introduce DualFocus, a unified framework for integrating positive and negative descriptors to enhance the interpretative accuracy of vision-language foundational models regarding textual queries. DualFocus employs Dual (Positive/Negative) Attribute Prompt Learning (DAPL), which integrates Dual Image-Attribute Contrastive (DIAC) Learning and Sensitive Image-Attributes Matching (SIAM) Learning. This way DualFocus enhances the detection of unseen attributes, thereby boosting retrieval precision. To further achieve a balance between coarse and fine-grained alignment of visual and textual embeddings, we propose the Dynamic Tokenwise Similarity (DTS) loss, which refines the representation of both matching and non-matching descriptions, thereby enhancing the matching process through a detailed and adaptable similarity assessment. By focusing on token-level comparisons, DualFocus significantly outperforms existing techniques in both precision and robustness. The experiment results highlight DualFocus's superior performance on CUHK-PEDES, ICFG-PEDES, and RSTPReid.
Related papers
- Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval [23.472806734625774]
We propose Dual prompt Learning with Joint Category-Attribute Reweighting (DCAR) to achieve precise image-text matching.<n>Based on the prompt paradigm, DCAR jointly optimize attribute and class features to enhance fine-grained representation learning.
arXiv Detail & Related papers (2025-08-06T02:44:08Z) - OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval [59.377821673653436]
Composed Image Retrieval (CIR) is capable of expressing users' intricate retrieval requirements flexibly.<n>CIR remains in its nascent stages due to two limitations: 1) inhomogeneity between dominant and noisy portions in visual data is ignored, leading to query feature degradation.<n>This work presents a focus mapping-based feature extractor, which consists of two modules: dominant portion segmentation and dual focus mapping.
arXiv Detail & Related papers (2025-07-08T03:27:46Z) - Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning [69.33115351856785]
We present a novel method, called T2I-PAL, to tackle the modality gap issue when using only text captions for PEFT.<n>The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions.<n>Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average.
arXiv Detail & Related papers (2025-06-12T11:09:49Z) - Descriptive Image-Text Matching with Graded Contextual Similarity [41.10869519062159]
We propose descriptive image-text matching, called DITM, to learn the graded contextual similarity between image and text.<n>We formulate the descriptiveness score of each sentence with cumulative term frequency-inverse document frequency (TF-IDF) to balance the pairwise similarity.<n>Our method leverages sentence descriptiveness to learn robust image-text matching in two key ways.
arXiv Detail & Related papers (2025-05-15T06:21:00Z) - CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching [31.42896369011162]
CoMatch is a novel semi-dense image matcher with dynamic covisibility awareness and bilateral subpixel accuracy.
A covisibility-guided token condenser is introduced to adaptively aggregate tokens in light of their covisibility scores.
A fine correlation module is developed to refine the matching candidates in both source and target views to subpixel level.
arXiv Detail & Related papers (2025-03-31T10:17:01Z) - Noisy-Correspondence Learning for Text-to-Image Person Re-identification [50.07634676709067]
We propose a novel Robust Dual Embedding method (RDE) to learn robust visual-semantic associations even with noisy correspondences.
Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on three datasets.
arXiv Detail & Related papers (2023-08-19T05:34:13Z) - DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition
with Limited Annotations [79.433122872973]
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance.
We leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs.
We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++)
arXiv Detail & Related papers (2023-08-03T17:33:20Z) - PV2TEA: Patching Visual Modality to Textual-Established Information
Extraction [59.76117533540496]
We patch the visual modality to the textual-established attribute information extractor.
PV2TEA is an encoder-decoder architecture equipped with three bias reduction schemes.
Empirical results on real-world e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1 increase over unimodal baselines.
arXiv Detail & Related papers (2023-06-01T05:39:45Z) - RaSa: Relation and Sensitivity Aware Representation Learning for
Text-based Person Search [51.09723403468361]
We propose a Relation and Sensitivity aware representation learning method (RaSa)
RaSa includes two novel tasks: Relation-Aware learning (RA) and Sensitivity-Aware learning (SA)
Experiments demonstrate that RaSa outperforms existing state-of-the-art methods by 6.94%, 4.45% and 15.35% in terms of Rank@1 on datasets.
arXiv Detail & Related papers (2023-05-23T03:53:57Z) - Weakly-Supervised Text-driven Contrastive Learning for Facial Behavior
Understanding [12.509298933267221]
We introduce a two-stage Contrastive Learning with Text-Embeded framework for Facial behavior understanding.
The first stage is a weakly-supervised contrastive learning method that learns representations from positive-negative pairs constructed using coarse-grained activity information.
The second stage aims to train the recognition of facial expressions or facial action units by maximizing the similarity between image and the corresponding text label names.
arXiv Detail & Related papers (2023-03-31T18:21:09Z) - Towards Effective Image Manipulation Detection with Proposal Contrastive
Learning [61.5469708038966]
We propose Proposal Contrastive Learning (PCL) for effective image manipulation detection.
Our PCL consists of a two-stream architecture by extracting two types of global features from RGB and noise views respectively.
Our PCL can be easily adapted to unlabeled data in practice, which can reduce manual labeling costs and promote more generalizable features.
arXiv Detail & Related papers (2022-10-16T13:30:13Z) - TaCo: Textual Attribute Recognition via Contrastive Learning [9.042957048594825]
TaCo is a contrastive framework for textual attribute recognition tailored toward the most common document scenes.
We design the learning paradigm from three perspectives: 1) generating attribute views, 2) extracting subtle but crucial details, and 3) exploiting valued view pairs for learning.
Experiments show that TaCo surpasses the supervised counterparts and advances the state-of-the-art remarkably on multiple attribute recognition tasks.
arXiv Detail & Related papers (2022-08-22T09:45:34Z) - Pose-guided Visible Part Matching for Occluded Person ReID [80.81748252960843]
We propose a Pose-guided Visible Part Matching (PVPM) method that jointly learns the discriminative features with pose-guided attention and self-mines the part visibility.
Experimental results on three reported occluded benchmarks show that the proposed method achieves competitive performance to state-of-the-art methods.
arXiv Detail & Related papers (2020-04-01T04:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.