Dissecting Deep Metric Learning Losses for Image-Text Retrieval
- URL: http://arxiv.org/abs/2210.13188v1
- Date: Fri, 21 Oct 2022 06:48:27 GMT
- Title: Dissecting Deep Metric Learning Losses for Image-Text Retrieval
- Authors: Hong Xuan, Xi Chen
- Abstract summary: Visual-Semantic Embedding (VSE) is a prevalent approach in image-text retrieval by learning a joint embedding space between the image and language modalities.
The triplet loss with hard-negative mining has become the de-facto objective for most VSE methods.
We present a novel Gradient-based Objective AnaLysis framework, or textitGOAL, to systematically analyze the combinations and reweighting of the gradients in existing DML functions.
- Score: 8.248111272824326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual-Semantic Embedding (VSE) is a prevalent approach in image-text
retrieval by learning a joint embedding space between the image and language
modalities where semantic similarities would be preserved. The triplet loss
with hard-negative mining has become the de-facto objective for most VSE
methods. Inspired by recent progress in deep metric learning (DML) in the image
domain which gives rise to new loss functions that outperform triplet loss, in
this paper, we revisit the problem of finding better objectives for VSE in
image-text matching. Despite some attempts in designing losses based on
gradient movement, most DML losses are defined empirically in the embedding
space. Instead of directly applying these loss functions which may lead to
sub-optimal gradient updates in model parameters, in this paper we present a
novel Gradient-based Objective AnaLysis framework, or \textit{GOAL}, to
systematically analyze the combinations and reweighting of the gradients in
existing DML functions. With the help of this analysis framework, we further
propose a new family of objectives in the gradient space exploring different
gradient combinations. In the event that the gradients are not integrable to a
valid loss function, we implement our proposed objectives such that they would
directly operate in the gradient space instead of on the losses in the
embedding space. Comprehensive experiments have demonstrated that our novel
objectives have consistently improved performance over baselines across
different visual/text features and model frameworks. We also showed the
generalizability of the GOAL framework by extending it to other models using
triplet family losses including vision-language model with heavy cross-modal
interactions and have achieved state-of-the-art results on the image-text
retrieval tasks on COCO and Flick30K.
Related papers
- Improving Neural Surface Reconstruction with Feature Priors from Multi-View Image [87.00660347447494]
Recent advancements in Neural Surface Reconstruction (NSR) have significantly improved multi-view reconstruction when coupled with volume rendering.
We propose an investigation into feature-level consistent loss, aiming to harness valuable feature priors from diverse pretext visual tasks.
Our results, analyzed on DTU and EPFL, reveal that feature priors from image matching and multi-view stereo datasets outperform other pretext tasks.
arXiv Detail & Related papers (2024-08-04T16:09:46Z) - RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering
Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images.
We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z) - Bridging the Gap: Multi-Level Cross-Modality Joint Alignment for
Visible-Infrared Person Re-Identification [41.600294816284865]
Visible-Infrared person Re-IDentification (VI-ReID) aims to match pedestrians' images across visible and infrared cameras.
To solve the modality gap, existing mainstream methods adopt a learning paradigm converting the image retrieval task into an image classification task.
We propose a simple and effective method, the Multi-level Cross-modality Joint Alignment (MCJA), bridging both modality and objective-level gap.
arXiv Detail & Related papers (2023-07-17T08:24:05Z) - Class Anchor Margin Loss for Content-Based Image Retrieval [97.81742911657497]
We propose a novel repeller-attractor loss that falls in the metric learning paradigm, yet directly optimize for the L2 metric without the need of generating pairs.
We evaluate the proposed objective in the context of few-shot and full-set training on the CBIR task, by using both convolutional and transformer architectures.
arXiv Detail & Related papers (2023-06-01T12:53:10Z) - LSEH: Semantically Enhanced Hard Negatives for Cross-modal Information
Retrieval [0.4264192013842096]
Visual Semantic Embedding (VSE) aims to extract the semantics of images and their descriptions, and embed them into the same latent space for information retrieval.
Most existing VSE networks are trained by adopting a hard negatives loss function which learns an objective margin between the similarity of relevant and irrelevant image-description embedding pairs.
This paper presents a novel approach that comprises two main parts: (1) finds the underlying semantics of image descriptions; and (2) proposes a novel semantically enhanced hard negatives loss function.
arXiv Detail & Related papers (2022-10-10T15:09:39Z) - Dissecting the impact of different loss functions with gradient surgery [7.001832294837659]
Pair-wise loss is an approach to metric learning that learns a semantic embedding by optimizing a loss function.
Here we decompose the gradient of these loss functions into components that relate to how they push the relative feature positions of the anchor-positive and anchor-negative pairs.
arXiv Detail & Related papers (2022-01-27T03:55:48Z) - Semantic Compositional Learning for Low-shot Scene Graph Generation [122.51930904132685]
Many scene graph generation (SGG) models solely use the limited annotated relation triples for training.
We propose a novel semantic compositional learning strategy that makes it possible to construct additional, realistic relation triples.
For three recent SGG models, adding our strategy improves their performance by close to 50%, and all of them substantially exceed the current state-of-the-art.
arXiv Detail & Related papers (2021-08-19T10:13:55Z) - InverseForm: A Loss Function for Structured Boundary-Aware Segmentation [80.39674800972182]
We present a novel boundary-aware loss term for semantic segmentation using an inverse-transformation network.
This plug-in loss term complements the cross-entropy loss in capturing boundary transformations.
We analyze the quantitative and qualitative effects of our loss function on three indoor and outdoor segmentation benchmarks.
arXiv Detail & Related papers (2021-04-06T18:52:45Z) - Progressive Self-Guided Loss for Salient Object Detection [102.35488902433896]
We present a progressive self-guided loss function to facilitate deep learning-based salient object detection in images.
Our framework takes advantage of adaptively aggregated multi-scale features to locate and detect salient objects effectively.
arXiv Detail & Related papers (2021-01-07T07:33:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.