LSEH: Semantically Enhanced Hard Negatives for Cross-modal Information
Retrieval
- URL: http://arxiv.org/abs/2210.04754v1
- Date: Mon, 10 Oct 2022 15:09:39 GMT
- Title: LSEH: Semantically Enhanced Hard Negatives for Cross-modal Information
Retrieval
- Authors: Yan Gong and Georgina Cosma
- Abstract summary: Visual Semantic Embedding (VSE) aims to extract the semantics of images and their descriptions, and embed them into the same latent space for information retrieval.
Most existing VSE networks are trained by adopting a hard negatives loss function which learns an objective margin between the similarity of relevant and irrelevant image-description embedding pairs.
This paper presents a novel approach that comprises two main parts: (1) finds the underlying semantics of image descriptions; and (2) proposes a novel semantically enhanced hard negatives loss function.
- Score: 0.4264192013842096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Semantic Embedding (VSE) aims to extract the semantics of images and
their descriptions, and embed them into the same latent space for cross-modal
information retrieval. Most existing VSE networks are trained by adopting a
hard negatives loss function which learns an objective margin between the
similarity of relevant and irrelevant image-description embedding pairs.
However, the objective margin in the hard negatives loss function is set as a
fixed hyperparameter that ignores the semantic differences of the irrelevant
image-description pairs. To address the challenge of measuring the optimal
similarities between image-description pairs before obtaining the trained VSE
networks, this paper presents a novel approach that comprises two main parts:
(1) finds the underlying semantics of image descriptions; and (2) proposes a
novel semantically enhanced hard negatives loss function, where the learning
objective is dynamically determined based on the optimal similarity scores
between irrelevant image-description pairs. Extensive experiments were carried
out by integrating the proposed methods into five state-of-the-art VSE networks
that were applied to three benchmark datasets for cross-modal information
retrieval tasks. The results revealed that the proposed methods achieved the
best performance and can also be adopted by existing and future VSE networks.
Related papers
- Introspective Deep Metric Learning [91.47907685364036]
We propose an introspective deep metric learning framework for uncertainty-aware comparisons of images.
The proposed IDML framework improves the performance of deep metric learning through uncertainty modeling.
arXiv Detail & Related papers (2023-09-11T16:21:13Z) - Collaborative Group: Composed Image Retrieval via Consensus Learning from Noisy Annotations [67.92679668612858]
We propose the Consensus Network (Css-Net), inspired by the psychological concept that groups outperform individuals.
Css-Net comprises two core components: (1) a consensus module with four diverse compositors, each generating distinct image-text embeddings; and (2) a Kullback-Leibler divergence loss that encourages learning of inter-compositor interactions.
On benchmark datasets, particularly FashionIQ, Css-Net demonstrates marked improvements. Notably, it achieves significant recall gains, with a 2.77% increase in R@10 and 6.67% boost in R@50, underscoring its
arXiv Detail & Related papers (2023-06-03T11:50:44Z) - Deep Semantic Statistics Matching (D2SM) Denoising Network [70.01091467628068]
We introduce the Deep Semantic Statistics Matching (D2SM) Denoising Network.
It exploits semantic features of pretrained classification networks, then it implicitly matches the probabilistic distribution of clear images at the semantic feature space.
By learning to preserve the semantic distribution of denoised images, we empirically find our method significantly improves the denoising capabilities of networks.
arXiv Detail & Related papers (2022-07-19T14:35:42Z) - S2-Net: Self-supervision Guided Feature Representation Learning for
Cross-Modality Images [0.0]
Cross-modality image pairs often fail to make the feature representations of correspondences as close as possible.
In this letter, we design a cross-modality feature representation learning network, S2-Net, which is based on the recently successful detect-and-describe pipeline.
We introduce self-supervised learning with a well-designed loss function to guide the training without discarding the original advantages.
arXiv Detail & Related papers (2022-03-28T08:47:49Z) - Contrastive Learning of Visual-Semantic Embeddings [4.7464518249313805]
We propose two loss functions based on normalized cross-entropy to perform the task of learning joint visual-semantic embedding.
We compare our results with existing visual-semantic embedding methods on cross-modal image-to-text and text-to-image retrieval tasks.
arXiv Detail & Related papers (2021-10-17T17:28:04Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - SOSD-Net: Joint Semantic Object Segmentation and Depth Estimation from
Monocular images [94.36401543589523]
We introduce the concept of semantic objectness to exploit the geometric relationship of these two tasks.
We then propose a Semantic Object and Depth Estimation Network (SOSD-Net) based on the objectness assumption.
To the best of our knowledge, SOSD-Net is the first network that exploits the geometry constraint for simultaneous monocular depth estimation and semantic segmentation.
arXiv Detail & Related papers (2021-01-19T02:41:03Z) - Tasks Integrated Networks: Joint Detection and Retrieval for Image
Search [99.49021025124405]
In many real-world searching scenarios (e.g., video surveillance), the objects are seldom accurately detected or annotated.
We first introduce an end-to-end Integrated Net (I-Net), which has three merits.
We further propose an improved I-Net, called DC-I-Net, which makes two new contributions.
arXiv Detail & Related papers (2020-09-03T03:57:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.