Related papers: Relieving Triplet Ambiguity: Consensus Network for Language-Guided Image Retrieval

Relieving Triplet Ambiguity: Consensus Network for Language-Guided Image Retrieval

URL: http://arxiv.org/abs/2306.02092v1
Date: Sat, 3 Jun 2023 11:50:44 GMT
Title: Relieving Triplet Ambiguity: Consensus Network for Language-Guided Image Retrieval
Authors: Xu Zhang, Zhedong Zheng, Xiaohan Wang, Yi Yang
Abstract summary: We propose a novel Consensus Network (Css-Net) that self-adaptively learns from noisy triplets to minimize the negative effects of triplet ambiguity. Css-Net can alleviate triplet ambiguity, achieving competitive performance on benchmarks, such as $+2.77%$ R@10 and $+6.67%$ R@50 on FashionIQ.
Score: 48.914550252133125
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language-guided image retrieval enables users to search for images and interact with the retrieval system more naturally and expressively by using a reference image and a relative caption as a query. Most existing studies mainly focus on designing image-text composition architecture to extract discriminative visual-linguistic relations. Despite great success, we identify an inherent problem that obstructs the extraction of discriminative features and considerably compromises model training: \textbf{triplet ambiguity}. This problem stems from the annotation process wherein annotators view only one triplet at a time. As a result, they often describe simple attributes, such as color, while neglecting fine-grained details like location and style. This leads to multiple false-negative candidates matching the same modification text. We propose a novel Consensus Network (Css-Net) that self-adaptively learns from noisy triplets to minimize the negative effects of triplet ambiguity. Inspired by the psychological finding that groups perform better than individuals, Css-Net comprises 1) a consensus module featuring four distinct compositors that generate diverse fused image-text embeddings and 2) a Kullback-Leibler divergence loss, which fosters learning among the compositors, enabling them to reduce biases learned from noisy triplets and reach a consensus. The decisions from four compositors are weighted during evaluation to further achieve consensus. Comprehensive experiments on three datasets demonstrate that Css-Net can alleviate triplet ambiguity, achieving competitive performance on benchmarks, such as $+2.77\%$ R@10 and $+6.67\%$ R@50 on FashionIQ.

Related papers

OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval [59.377821673653436]
Composed Image Retrieval (CIR) is capable of expressing users' intricate retrieval requirements flexibly.<n>CIR remains in its nascent stages due to two limitations: 1) inhomogeneity between dominant and noisy portions in visual data is ignored, leading to query feature degradation.<n>This work presents a focus mapping-based feature extractor, which consists of two modules: dominant portion segmentation and dual focus mapping.
arXiv Detail & Related papers (2025-07-08T03:27:46Z)
Redemption Score: An Evaluation Framework to Rank Image Captions While Redeeming Image Semantics and Language Pragmatics [0.0]
Redemption Score is a novel framework that ranks image captions by triangulating three complementary signals.<n>On the Flickr8k benchmark, Redemption Score achieves a Kendall-$tau$ of 56.43, outperforming twelve prior methods.
arXiv Detail & Related papers (2025-05-22T03:35:12Z)
Embedding and Enriching Explicit Semantics for Visible-Infrared Person Re-Identification [31.011118085494942]
Visible-infrared person re-identification (VIReID) retrieves pedestrian images with the same identity across different modalities. Existing methods learn visual content solely from images, lacking the capability to sense high-level semantics. We propose an Embedding and Enriching Explicit Semantics framework to learn semantically rich cross-modality pedestrian representations.
arXiv Detail & Related papers (2024-12-11T14:27:30Z)
Noisy-Correspondence Learning for Text-to-Image Person Re-identification [50.07634676709067]
We propose a novel Robust Dual Embedding method (RDE) to learn robust visual-semantic associations even with noisy correspondences. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on three datasets.
arXiv Detail & Related papers (2023-08-19T05:34:13Z)
Ranking-aware Uncertainty for Text-guided Image Retrieval [17.70430913227593]
We propose a novel ranking-aware uncertainty approach to model many-to-many correspondences. Compared to the existing state-of-the-art methods, our proposed method achieves significant results on two public datasets.
arXiv Detail & Related papers (2023-08-16T03:48:19Z)
PV2TEA: Patching Visual Modality to Textual-Established Information Extraction [59.76117533540496]
We patch the visual modality to the textual-established attribute information extractor. PV2TEA is an encoder-decoder architecture equipped with three bias reduction schemes. Empirical results on real-world e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1 increase over unimodal baselines.
arXiv Detail & Related papers (2023-06-01T05:39:45Z)
Image-Text Retrieval with Binary and Continuous Label Supervision [38.682970905704906]
This paper proposes an image-text retrieval framework with Binary and Continuous Label Supervision (BCLS) For the learning of binary labels, we improve the common Triplet ranking loss with Soft Negative mining (Triplet-SN) to improve convergence. For the learning of continuous labels, we design Kendall ranking loss inspired by Kendall rank correlation coefficient (Kendall) to improve the correlation between the similarity scores predicted by the retrieval model and the continuous labels.
arXiv Detail & Related papers (2022-10-20T14:52:34Z)
Two-stage Visual Cues Enhancement Network for Referring Image Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression. In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net) Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z)
Dual-path CNN with Max Gated block for Text-Based Person Re-identification [6.1534388046236765]
A novel Dual-path CNN with Max Gated block (DCMG) is proposed to extract discriminative word embeddings. The framework is based on two deep residual CNNs jointly optimized with cross-modal projection matching. Our approach achieves the rank-1 score of 55.81% and outperforms the state-of-the-art method by 1.3%.
arXiv Detail & Related papers (2020-09-20T03:33:29Z)
Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language. Most existing approaches only rely on the image-text instance pair to learn their representations. We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z)
Image-to-Image Translation with Text Guidance [139.41321867508722]
The goal of this paper is to embed controllable factors, i.e., natural language descriptions, into image-to-image translation with generative adversarial networks. We propose four key components: (1) the implementation of part-of-speech tagging to filter out non-semantic words in the given description, (2) the adoption of an affine combination module to effectively fuse different modality text and image features, and (3) a novel refined multi-stage architecture to strengthen the differential ability of discriminators and the rectification ability of generators.
arXiv Detail & Related papers (2020-02-12T21:09:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.