Related papers: Noisy-Correspondence Learning for Text-to-Image Person Re-identification

Noisy-Correspondence Learning for Text-to-Image Person Re-identification

URL: http://arxiv.org/abs/2308.09911v3
Date: Thu, 28 Mar 2024 07:16:11 GMT
Title: Noisy-Correspondence Learning for Text-to-Image Person Re-identification
Authors: Yang Qin, Yingke Chen, Dezhong Peng, Xi Peng, Joey Tianyi Zhou, Peng Hu,
Abstract summary: We propose a novel Robust Dual Embedding method (RDE) to learn robust visual-semantic associations even with noisy correspondences. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on three datasets.
Score: 50.07634676709067
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image person re-identification (TIReID) is a compelling topic in the cross-modal community, which aims to retrieve the target person based on a textual query. Although numerous TIReID methods have been proposed and achieved promising performance, they implicitly assume the training image-text pairs are correctly aligned, which is not always the case in real-world scenarios. In practice, the image-text pairs inevitably exist under-correlated or even false-correlated, a.k.a noisy correspondence (NC), due to the low quality of the images and annotation errors. To address this problem, we propose a novel Robust Dual Embedding method (RDE) that can learn robust visual-semantic associations even with NC. Specifically, RDE consists of two main components: 1) A Confident Consensus Division (CCD) module that leverages the dual-grained decisions of dual embedding modules to obtain a consensus set of clean training data, which enables the model to learn correct and reliable visual-semantic associations. 2) A Triplet Alignment Loss (TAL) relaxes the conventional Triplet Ranking loss with the hardest negative samples to a log-exponential upper bound over all negative ones, thus preventing the model collapse under NC and can also focus on hard-negative samples for promising performance. We conduct extensive experiments on three public benchmarks, namely CUHK-PEDES, ICFG-PEDES, and RSTPReID, to evaluate the performance and robustness of our RDE. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on all three datasets. Code is available at https://github.com/QinYang79/RDE.

Related papers

From Mapping to Composing: A Two-Stage Framework for Zero-shot Composed Image Retrieval [30.33315985826623]
Composed Image Retrieval (CIR) is a challenging multimodal task that retrieves a target image based on a reference image and accompanying modification text. We propose a two-stage framework where the training is accomplished from mapping to composing. In the first stage, we enhance image-to-pseudo-word token learning by introducing a visual semantic injection module. In the second stage, we optimize the text encoder using a small amount of synthetic triplet data, enabling it to effectively extract compositional semantics.
arXiv Detail & Related papers (2025-04-25T00:18:23Z)
MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval [20.612534837883892]
Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images. In this paper, we propose a two-stage framework to tackle both discrepancies. MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost.
arXiv Detail & Related papers (2024-10-31T08:49:05Z)
DualFocus: Integrating Plausible Descriptions in Text-based Person Re-identification [6.381155145404096]
We introduce DualFocus, a unified framework that integrates plausible descriptions to enhance the interpretative accuracy of vision-language models in Person Re-identification tasks. To achieve a balance between coarse and fine-grained alignment of visual and textual embeddings, we propose the Dynamic Tokenwise Similarity (DTS) loss. The comprehensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid, DualFocus demonstrates superior performance over state-of-the-art methods.
arXiv Detail & Related papers (2024-05-13T04:21:00Z)
Symmetrical Bidirectional Knowledge Alignment for Zero-Shot Sketch-Based Image Retrieval [69.46139774646308]
This paper studies the problem of zero-shot sketch-based image retrieval (ZS-SBIR) It aims to use sketches from unseen categories as queries to match the images of the same category. We propose a novel Symmetrical Bidirectional Knowledge Alignment for zero-shot sketch-based image retrieval (SBKA)
arXiv Detail & Related papers (2023-12-16T04:50:34Z)
Dynamic Weighted Combiner for Mixed-Modal Image Retrieval [8.683144453481328]
Mixed-Modal Image Retrieval (MMIR) as a flexible search paradigm has attracted wide attention. Previous approaches always achieve limited performance, due to two critical factors. We propose a Dynamic Weighted Combiner (DWC) to tackle the above challenges.
arXiv Detail & Related papers (2023-12-11T07:36:45Z)
Collaborative Group: Composed Image Retrieval via Consensus Learning from Noisy Annotations [67.92679668612858]
We propose the Consensus Network (Css-Net), inspired by the psychological concept that groups outperform individuals. Css-Net comprises two core components: (1) a consensus module with four diverse compositors, each generating distinct image-text embeddings; and (2) a Kullback-Leibler divergence loss that encourages learning of inter-compositor interactions. On benchmark datasets, particularly FashionIQ, Css-Net demonstrates marked improvements. Notably, it achieves significant recall gains, with a 2.77% increase in R@10 and 6.67% boost in R@50, underscoring its
arXiv Detail & Related papers (2023-06-03T11:50:44Z)
CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval [108.48540976175457]
We propose Coupled Diversity-Sensitive Momentum Constrastive Learning (CODER) for improving cross-modal representation. We introduce dynamic dictionaries for both modalities to enlarge the scale of image-text pairs, and diversity-sensitiveness is achieved by adaptive negative pair weighting. Experiments conducted on two popular benchmarks, i.e. MSCOCO and Flicker30K, validate CODER remarkably outperforms the state-of-the-art approaches.
arXiv Detail & Related papers (2022-08-21T08:37:50Z)
Optimized latent-code selection for explainable conditional text-to-image GANs [8.26410341981427]
We present a variety of techniques to take a deep look into the latent space and semantic space of the conditional text-to-image GANs model. We propose a framework for finding good latent codes by utilizing a linear SVM.
arXiv Detail & Related papers (2022-04-27T03:12:55Z)
Inter-class Discrepancy Alignment for Face Recognition [55.578063356210144]
We propose a unified framework calledInter-class DiscrepancyAlignment(IDA) IDA-DAO is used to align the similarity scores considering the discrepancy between the images and its neighbors. IDA-SSE can provide convincing inter-class neighbors by introducing virtual candidate images generated with GAN.
arXiv Detail & Related papers (2021-03-02T08:20:08Z)
Devil's in the Details: Aligning Visual Clues for Conditional Embedding in Person Re-Identification [94.77172127405846]
We propose two key recognition patterns to better utilize the detail information of pedestrian images. CACE-Net achieves state-of-the-art performance on three public datasets.
arXiv Detail & Related papers (2020-09-11T06:28:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.