SCMM: Calibrating Cross-modal Representations for Text-Based Person Search
- URL: http://arxiv.org/abs/2304.02278v3
- Date: Thu, 17 Oct 2024 08:57:50 GMT
- Title: SCMM: Calibrating Cross-modal Representations for Text-Based Person Search
- Authors: Jing Liu, Donglai Wei, Yang Liu, Sipeng Zhang, Tong Yang, Victor C. M. Leung,
- Abstract summary: Text-Based Person Search (TBPS) is a crucial task that enables accurate retrieval of target individuals from large-scale galleries.
For cross-modal TBPS tasks, it is critical to obtain well-distributed representation in the common embedding space.
We present a method named Sew and Masked Modeling (SCMM) that calibrates cross-modal representations by learning compact and well-aligned embeddings.
- Score: 43.17325362167387
- License:
- Abstract: Text-Based Person Search (TBPS) is a crucial task that enables accurate retrieval of target individuals from large-scale galleries with only given textual caption. For cross-modal TBPS tasks, it is critical to obtain well-distributed representation in the common embedding space to reduce the inter-modal gap. Furthermore, learning detailed image-text correspondences is essential to discriminate similar targets and enable fine-grained search. To address these challenges, we present a simple yet effective method named Sew Calibration and Masked Modeling (SCMM) that calibrates cross-modal representations by learning compact and well-aligned embeddings. SCMM is distinguished by two novel losses to provide fine-grained cross-modal representations: 1) a Sew calibration loss that takes the quality of textual captions as guidance and aligns features between image and text modalities, and 2) a Masked Caption Modeling (MCM) loss that leverages a masked caption prediction task to establish detailed and generic relationships between textual and visual parts. The dual-pronged strategy refines feature alignment and enriches cross-modal correspondences, enabling the accurate distinction of similar individuals. Consequently, its streamlined dual-encoder architecture avoids complex branches and interactions and facilitates high-speed inference suitable for real-time requirements. Consequently, high-speed inference is achieved, which is essential for resource-limited applications often demanding real-time processing. Extensive experiments on three popular TBPS benchmarks demonstrate the superiority of SCMM, achieving top results with 73.81%, 74.25%, and 57.35% Rank-1 accuracy on CUHK-PEDES, ICFG-PEDES, and RSTPReID, respectively. We hope SCMM's scalable and cost-effective design will serve as a strong baseline and facilitate future research in this field.
Related papers
- MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval [20.612534837883892]
Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images.
In this paper, we propose a two-stage framework to tackle both discrepancies.
MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost.
arXiv Detail & Related papers (2024-10-31T08:49:05Z) - Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model
via Cross-modal Alignment [2.389598109913754]
We focus on Contrastive Language-Image Pre-training (CLIP), an open-vocabulary foundation model, which achieves high accuracy across many image classification tasks.
There are still domains where zero-shot CLIP performance is far from optimal, such as Remote Sensing (RS) and medical imagery.
We propose a methodology for the purpose of aligning distinct RS imagery modalities with the visual and textual modalities of CLIP.
arXiv Detail & Related papers (2024-02-15T09:31:07Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Cross-BERT for Point Cloud Pretraining [61.762046503448936]
We propose a new cross-modal BERT-style self-supervised learning paradigm, called Cross-BERT.
To facilitate pretraining for irregular and sparse point clouds, we design two self-supervised tasks to boost cross-modal interaction.
Our work highlights the effectiveness of leveraging cross-modal 2D knowledge to strengthen 3D point cloud representation and the transferable capability of BERT across modalities.
arXiv Detail & Related papers (2023-12-08T08:18:12Z) - Plug-and-Play Regulators for Image-Text Matching [76.28522712930668]
Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching.
We develop two simple but quite effective regulators which efficiently encode the message output to automatically contextualize and aggregate cross-modal representations.
Experiments on MSCOCO and Flickr30K datasets validate that they can bring an impressive and consistent R@1 gain on multiple models.
arXiv Detail & Related papers (2023-03-23T15:42:05Z) - Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image
Person Retrieval [29.884153827619915]
We present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework.
It learns relations between local visual-textual tokens and enhances global image-text matching.
The proposed method achieves new state-of-the-art results on all three public datasets.
arXiv Detail & Related papers (2023-03-22T12:11:59Z) - CLIP-Driven Fine-grained Text-Image Person Re-identification [50.94827165464813]
TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images.
We propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
arXiv Detail & Related papers (2022-10-19T03:43:12Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Deep Unsupervised Contrastive Hashing for Large-Scale Cross-Modal
Text-Image Retrieval in Remote Sensing [1.6758573326215689]
We introduce a novel deep unsupervised cross-modal contrastive hashing (DUCH) method for RS text-image retrieval.
Experimental results show that the proposed DUCH outperforms state-of-the-art unsupervised cross-modal hashing methods.
Our code is publicly available at https://git.tu-berlin.de/rsim/duch.
arXiv Detail & Related papers (2022-01-20T12:05:10Z) - Dual-path CNN with Max Gated block for Text-Based Person
Re-identification [6.1534388046236765]
A novel Dual-path CNN with Max Gated block (DCMG) is proposed to extract discriminative word embeddings.
The framework is based on two deep residual CNNs jointly optimized with cross-modal projection matching.
Our approach achieves the rank-1 score of 55.81% and outperforms the state-of-the-art method by 1.3%.
arXiv Detail & Related papers (2020-09-20T03:33:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.