Hierarchical Similarity Learning for Language-based Product Image
Retrieval
- URL: http://arxiv.org/abs/2102.09375v1
- Date: Thu, 18 Feb 2021 14:23:16 GMT
- Title: Hierarchical Similarity Learning for Language-based Product Image
Retrieval
- Authors: Zhe Ma, Fenghao Liu, Jianfeng Dong, Xiaoye Qu, Yuan He, Shouling Ji
- Abstract summary: This paper focuses on the cross-modal similarity measurement, and proposes a novel Hierarchical Similarity Learning network.
Experiments on a large-scale product retrieval dataset demonstrate the effectiveness of our proposed method.
- Score: 40.83290730640458
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper aims for the language-based product image retrieval task. The
majority of previous works have made significant progress by designing network
structure, similarity measurement, and loss function. However, they typically
perform vision-text matching at certain granularity regardless of the intrinsic
multiple granularities of images. In this paper, we focus on the cross-modal
similarity measurement, and propose a novel Hierarchical Similarity Learning
(HSL) network. HSL first learns multi-level representations of input data by
stacked encoders, and object-granularity similarity and image-granularity
similarity are computed at each level. All the similarities are combined as the
final hierarchical cross-modal similarity. Experiments on a large-scale product
retrieval dataset demonstrate the effectiveness of our proposed method. Code
and data are available at https://github.com/liufh1/hsl.
Related papers
- Pairwise Similarity Learning is SimPLE [104.14303849615496]
We focus on a general yet important learning problem, pairwise similarity learning (PSL)
PSL subsumes a wide range of important applications, such as open-set face recognition, speaker verification, image retrieval and person re-identification.
We propose a surprisingly simple proxy-free method, called SimPLE, which requires neither feature/proxy normalization nor angular margin.
arXiv Detail & Related papers (2023-10-13T23:56:47Z) - Attributable Visual Similarity Learning [90.69718495533144]
This paper proposes an attributable visual similarity learning (AVSL) framework for a more accurate and explainable similarity measure between images.
Motivated by the human semantic similarity cognition, we propose a generalized similarity learning paradigm to represent the similarity between two images with a graph.
Experiments on the CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate significant improvements over existing deep similarity learning methods.
arXiv Detail & Related papers (2022-03-28T17:35:31Z) - Two-stream Hierarchical Similarity Reasoning for Image-text Matching [66.43071159630006]
A hierarchical similarity reasoning module is proposed to automatically extract context information.
Previous approaches only consider learning single-stream similarity alignment.
A two-stream architecture is developed to decompose image-text matching into image-to-text level and text-to-image level similarity computation.
arXiv Detail & Related papers (2022-03-10T12:56:10Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z) - Exploiting the relationship between visual and textual features in
social networks for image classification with zero-shot deep learning [0.0]
In this work, we propose a classifier ensemble based on the transferable learning capabilities of the CLIP neural network architecture.
Our experiments, based on image classification tasks according to the labels of the Places dataset, are performed by first considering only the visual part.
Considering the associated texts to the images can help to improve the accuracy depending on the goal.
arXiv Detail & Related papers (2021-07-08T10:54:59Z) - Transformer Reasoning Network for Image-Text Matching and Retrieval [14.238818604272751]
We consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval.
We introduce the Transformer Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive, the Transformer.
TERN is able to separately reason on the two different modalities and to enforce a final common abstract concept space.
arXiv Detail & Related papers (2020-04-20T09:09:01Z) - Distilling Localization for Self-Supervised Representation Learning [82.79808902674282]
Contrastive learning has revolutionized unsupervised representation learning.
Current contrastive models are ineffective at localizing the foreground object.
We propose a data-driven approach for learning in variance to backgrounds.
arXiv Detail & Related papers (2020-04-14T16:29:42Z) - Deep Multimodal Image-Text Embeddings for Automatic Cross-Media
Retrieval [0.0]
We introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously.
The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking.
arXiv Detail & Related papers (2020-02-23T23:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.