Related papers: Calibrating Cross-modal Features for Text-Based Person Searching

Calibrating Cross-modal Features for Text-Based Person Searching

URL: http://arxiv.org/abs/2304.02278v2
Date: Thu, 1 Jun 2023 01:49:26 GMT
Title: Calibrating Cross-modal Features for Text-Based Person Searching
Authors: Donglai Wei, Sipeng Zhang, Tong Yang, Yang Liu, Jing Liu
Abstract summary: We present a simple yet effective method that calibrates cross-modal features from two perspectives. Our method consists of two novel losses to provide fine-grained cross-modal features. It achieves top results on three popular benchmarks with 73.81%, 74.25% and 57.35% Rank1 accuracy.
Score: 18.3145271655619
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-Based Person Searching (TBPS) aims to identify the images of pedestrian targets from a large-scale gallery with given textual caption. For cross-modal TBPS task, it is critical to obtain well-distributed representation in the common embedding space to reduce the inter-modal gap. Furthermore, it is also essential to learn detailed image-text correspondence efficiently to discriminate similar targets and enable fine-grained target search. To address these challenges, we present a simple yet effective method that calibrates cross-modal features from these two perspectives. Our method consists of two novel losses to provide fine-grained cross-modal features. The Sew calibration loss takes the quality of textual captions as guidance and aligns features between image and text modalities. On the other hand, the Masking Caption Modeling (MCM) loss leverages a masked captions prediction task to establish detailed and generic relationships between textual and visual parts. The proposed method is cost-effective and can easily retrieve specific persons with textual captions. The architecture has only a dual-encoder without multi-level branches or extra interaction modules, making a high-speed inference. Our method achieves top results on three popular benchmarks with 73.81%, 74.25% and 57.35% Rank1 accuracy on the CUHK-PEDES, ICFG-PEDES, and RSTPReID, respectively. We hope our scalable method will serve as a solid baseline and help ease future research in TBPS. The code will be publicly available.

Related papers

I2CR: Intra- and Inter-modal Collaborative Reflections for Multimodal Entity Linking [8.758773321492809]
We propose a novel framework for the multimodal entity linking task, called Intra- and Inter-modal Collaborative Reflections.<n>Our framework consistently outperforms current state-of-the-art methods in the task, achieving improvements of 3.2%, 5.1%, and 1.6%, respectively.
arXiv Detail & Related papers (2025-08-04T09:43:54Z)
Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring [26.174094671736686]
We propose a novel quality-driven data selection pipeline for visual instruction tuning datasets.<n>It integrates a cross-modality assessment framework that first assigns each data entry to its appropriate vision-language task.<n>It generates general and task-specific captions, and evaluates the alignment, clarity, task rarity, text coherence, and image clarity of each entry.
arXiv Detail & Related papers (2025-06-10T04:04:58Z)
BiXFormer: A Robust Framework for Maximizing Modality Effectiveness in Multi-Modal Semantic Segmentation [55.486872677160015]
We reformulate multi-modal semantic segmentation as a mask-level classification task.<n>We propose BiXFormer, which integrates Unified Modality Matching (UMM) and Cross Modality Alignment (CMA)<n> Experiments on both synthetic and real-world multi-modal benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2025-06-04T08:04:58Z)
Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval [15.503629941274621]
This study proposes a Representation Discrepancy Bridging (RDB) method for the Remote Image-Text Retrieval (RSITR) task.<n>Experiments on RSICD and RSITMD datasets show that the proposed RDB method achieves a 6%-11% improvement in mR metrics.
arXiv Detail & Related papers (2025-05-22T14:59:30Z)
Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search [64.15205542003056]
We introduce Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM) Modeling and Text Enrichment Module (TEM) AGA achieves new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTP, respectively.
arXiv Detail & Related papers (2024-12-19T17:51:49Z)
MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval [20.612534837883892]
Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images. In this paper, we propose a two-stage framework to tackle both discrepancies. MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost.
arXiv Detail & Related papers (2024-10-31T08:49:05Z)
Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation [54.96563068182733]
We propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task. MADM utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities. We show that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
arXiv Detail & Related papers (2024-10-29T03:49:40Z)
Modality Prompts for Arbitrary Modality Salient Object Detection [57.610000247519196]
This paper delves into the task of arbitrary modality salient object detection (AM SOD) It aims to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images. A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD.
arXiv Detail & Related papers (2024-05-06T11:02:02Z)
Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment [2.389598109913754]
We focus on Contrastive Language-Image Pre-training (CLIP), an open-vocabulary foundation model, which achieves high accuracy across many image classification tasks. There are still domains where zero-shot CLIP performance is far from optimal, such as Remote Sensing (RS) and medical imagery. We propose a methodology for the purpose of aligning distinct RS imagery modalities with the visual and textual modalities of CLIP.
arXiv Detail & Related papers (2024-02-15T09:31:07Z)
From Text to Pixels: A Context-Aware Semantic Synergy Solution for Infrared and Visible Image Fusion [66.33467192279514]
We introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images. Our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-12-31T08:13:47Z)
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery. We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z)
Cross-BERT for Point Cloud Pretraining [61.762046503448936]
We propose a new cross-modal BERT-style self-supervised learning paradigm, called Cross-BERT. To facilitate pretraining for irregular and sparse point clouds, we design two self-supervised tasks to boost cross-modal interaction. Our work highlights the effectiveness of leveraging cross-modal 2D knowledge to strengthen 3D point cloud representation and the transferable capability of BERT across modalities.
arXiv Detail & Related papers (2023-12-08T08:18:12Z)
CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation [29.885991324519463]
We propose a novel cross-modality masked self-distillation framework named CM-MaskSD. Our method inherits the transferred knowledge of image-text semantic alignment from CLIP model to realize fine-grained patch-word feature alignment. Our framework can considerably boost model performance in a nearly parameter-free manner.
arXiv Detail & Related papers (2023-05-19T07:17:27Z)
Plug-and-Play Regulators for Image-Text Matching [76.28522712930668]
Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching. We develop two simple but quite effective regulators which efficiently encode the message output to automatically contextualize and aggregate cross-modal representations. Experiments on MSCOCO and Flickr30K datasets validate that they can bring an impressive and consistent R@1 gain on multiple models.
arXiv Detail & Related papers (2023-03-23T15:42:05Z)
Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval [29.884153827619915]
We present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework. It learns relations between local visual-textual tokens and enhances global image-text matching. The proposed method achieves new state-of-the-art results on all three public datasets.
arXiv Detail & Related papers (2023-03-22T12:11:59Z)
CLIP-Driven Fine-grained Text-Image Person Re-identification [50.94827165464813]
TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images. We propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
arXiv Detail & Related papers (2022-10-19T03:43:12Z)
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval. Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference. Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z)
Deep Unsupervised Contrastive Hashing for Large-Scale Cross-Modal Text-Image Retrieval in Remote Sensing [1.6758573326215689]
We introduce a novel deep unsupervised cross-modal contrastive hashing (DUCH) method for RS text-image retrieval. Experimental results show that the proposed DUCH outperforms state-of-the-art unsupervised cross-modal hashing methods. Our code is publicly available at https://git.tu-berlin.de/rsim/duch.
arXiv Detail & Related papers (2022-01-20T12:05:10Z)
Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations. We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z)
Dual-path CNN with Max Gated block for Text-Based Person Re-identification [6.1534388046236765]
A novel Dual-path CNN with Max Gated block (DCMG) is proposed to extract discriminative word embeddings. The framework is based on two deep residual CNNs jointly optimized with cross-modal projection matching. Our approach achieves the rank-1 score of 55.81% and outperforms the state-of-the-art method by 1.3%.
arXiv Detail & Related papers (2020-09-20T03:33:29Z)
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders [14.634046503477979]
We present a novel approach called Transformer Reasoning and Alignment Network (TERAN) TERAN enforces a fine-grained match between the underlying components of images and sentences. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks.
arXiv Detail & Related papers (2020-08-12T11:02:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.