SCMM: Calibrating Cross-modal Representations for Text-Based Person Search
- URL: http://arxiv.org/abs/2304.02278v5
- Date: Fri, 06 Dec 2024 10:13:10 GMT
- Title: SCMM: Calibrating Cross-modal Representations for Text-Based Person Search
- Authors: Jing Liu, Donglai Wei, Yang Liu, Sipeng Zhang, Tong Yang, Victor C. M. Leung,
- Abstract summary: Text-Based Person Search (TBPS) is a crucial task in the Internet of Things (IoT) domain.
For cross-modal TBPS tasks, it is critical to obtain well-distributed representation in the common space.
We present Sew embedding and Masked Modeling (SCMM) that calibrates cross-modal representations by learning compact and well-aligned embeddings.
- Score: 43.17325362167387
- License:
- Abstract: Text-Based Person Search (TBPS) is a crucial task in the Internet of Things (IoT) domain that enables accurate retrieval of target individuals from large-scale galleries with only given textual caption. For cross-modal TBPS tasks, it is critical to obtain well-distributed representation in the common embedding space to reduce the inter-modal gap. Furthermore, learning detailed image-text correspondences is essential to discriminate similar targets and enable fine-grained search. To address these challenges, we present a simple yet effective method named Sew Calibration and Masked Modeling (SCMM) that calibrates cross-modal representations by learning compact and well-aligned embeddings. SCMM introduces two novel losses for fine-grained cross-modal representations: Sew calibration loss that aligns image and text features based on textual caption quality, and Masked Caption Modeling (MCM) loss that establishes detailed relationships between textual and visual parts. This dual-pronged strategy enhances feature alignment and cross-modal correspondences, enabling accurate distinction of similar individuals while maintaining a streamlined dual-encoder architecture for real-time inference, which is essential for resource-limited sensors and IoT systems. Extensive experiments on three popular TBPS benchmarks demonstrate the superiority of SCMM, achieving 73.81%, 64.25%, and 57.35% Rank-1 accuracy on CUHK-PEDES, ICFG-PEDES, and RSTPReID, respectively.
Related papers
- Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search [64.15205542003056]
We introduce Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM) Modeling and Text Enrichment Module (TEM)
AGA achieves new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTP, respectively.
arXiv Detail & Related papers (2024-12-19T17:51:49Z) - Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation [54.96563068182733]
We propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task.
MADM utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities.
We show that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
arXiv Detail & Related papers (2024-10-29T03:49:40Z) - Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model
via Cross-modal Alignment [2.389598109913754]
We focus on Contrastive Language-Image Pre-training (CLIP), an open-vocabulary foundation model, which achieves high accuracy across many image classification tasks.
There are still domains where zero-shot CLIP performance is far from optimal, such as Remote Sensing (RS) and medical imagery.
We propose a methodology for the purpose of aligning distinct RS imagery modalities with the visual and textual modalities of CLIP.
arXiv Detail & Related papers (2024-02-15T09:31:07Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image
Segmentation [29.885991324519463]
We propose a novel cross-modality masked self-distillation framework named CM-MaskSD.
Our method inherits the transferred knowledge of image-text semantic alignment from CLIP model to realize fine-grained patch-word feature alignment.
Our framework can considerably boost model performance in a nearly parameter-free manner.
arXiv Detail & Related papers (2023-05-19T07:17:27Z) - Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image
Person Retrieval [29.884153827619915]
We present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework.
It learns relations between local visual-textual tokens and enhances global image-text matching.
The proposed method achieves new state-of-the-art results on all three public datasets.
arXiv Detail & Related papers (2023-03-22T12:11:59Z) - CLIP-Driven Fine-grained Text-Image Person Re-identification [50.94827165464813]
TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images.
We propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
arXiv Detail & Related papers (2022-10-19T03:43:12Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Dual-path CNN with Max Gated block for Text-Based Person
Re-identification [6.1534388046236765]
A novel Dual-path CNN with Max Gated block (DCMG) is proposed to extract discriminative word embeddings.
The framework is based on two deep residual CNNs jointly optimized with cross-modal projection matching.
Our approach achieves the rank-1 score of 55.81% and outperforms the state-of-the-art method by 1.3%.
arXiv Detail & Related papers (2020-09-20T03:33:29Z) - Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using
Transformer Encoders [14.634046503477979]
We present a novel approach called Transformer Reasoning and Alignment Network (TERAN)
TERAN enforces a fine-grained match between the underlying components of images and sentences.
On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks.
arXiv Detail & Related papers (2020-08-12T11:02:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.