Improving Cross-Modal Retrieval with Set of Diverse Embeddings
- URL: http://arxiv.org/abs/2211.16761v3
- Date: Mon, 24 Jul 2023 13:53:26 GMT
- Title: Improving Cross-Modal Retrieval with Set of Diverse Embeddings
- Authors: Dongwon Kim, Namyup Kim, Suha Kwak
- Abstract summary: Cross-modal retrieval across image and text modalities is a challenging task due to its inherent ambiguity.
Set-based embedding has been studied as a solution to this problem.
We present a novel set-based embedding method, which is distinct from previous work in two aspects.
- Score: 19.365974066256026
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-modal retrieval across image and text modalities is a challenging task
due to its inherent ambiguity: An image often exhibits various situations, and
a caption can be coupled with diverse images. Set-based embedding has been
studied as a solution to this problem. It seeks to encode a sample into a set
of different embedding vectors that capture different semantics of the sample.
In this paper, we present a novel set-based embedding method, which is distinct
from previous work in two aspects. First, we present a new similarity function
called smooth-Chamfer similarity, which is designed to alleviate the side
effects of existing similarity functions for set-based embedding. Second, we
propose a novel set prediction module to produce a set of embedding vectors
that effectively captures diverse semantics of input by the slot attention
mechanism. Our method is evaluated on the COCO and Flickr30K datasets across
different visual backbones, where it outperforms existing methods including
ones that demand substantially larger computation at inference.
Related papers
- FUSE-ing Language Models: Zero-Shot Adapter Discovery for Prompt Optimization Across Tokenizers [55.2480439325792]
We propose FUSE, an approach to approximating an adapter layer that maps from one model's textual embedding space to another, even across different tokenizers.
We show the efficacy of our approach via multi-objective optimization over vision-language and causal language models for image captioning and sentiment-based image captioning.
arXiv Detail & Related papers (2024-08-09T02:16:37Z) - Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning [71.14084801851381]
Change captioning aims to succinctly describe the semantic change between a pair of similar images.
Most existing methods directly capture the difference between them, which risk obtaining error-prone difference features.
We propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations.
arXiv Detail & Related papers (2024-07-16T13:00:33Z) - Cross-Modal Coordination Across a Diverse Set of Input Modalities [0.0]
Cross-modal retrieval is the task of retrieving samples of a given modality by using queries of a different one.
This paper proposes two approaches to the problem: the first is based on an extension of the CLIP contrastive objective to an arbitrary number of input modalities.
The second departs from the contrastive formulation and tackles the coordination problem by regressing the cross-modal similarities towards a target.
arXiv Detail & Related papers (2024-01-29T17:53:25Z) - Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking [0.5242869847419834]
We propose a Dynamic Visual Semantic Sub-Embeddings framework (DVSE) to reduce the information entropy.
To encourage the generated candidate embeddings to capture various semantic variations, we construct a mixed distribution.
We compare the performance with existing set-based method using four image feature encoders and two text feature encoders on three benchmark datasets.
arXiv Detail & Related papers (2023-09-15T04:39:11Z) - Break-A-Scene: Extracting Multiple Concepts from a Single Image [80.47666266017207]
We introduce the task of textual scene decomposition.
We propose augmenting the input image with masks that indicate the presence of target concepts.
We then present a novel two-phase customization process.
arXiv Detail & Related papers (2023-05-25T17:59:04Z) - Boosting Few-shot Fine-grained Recognition with Background Suppression
and Foreground Alignment [53.401889855278704]
Few-shot fine-grained recognition (FS-FGR) aims to recognize novel fine-grained categories with the help of limited available samples.
We propose a two-stage background suppression and foreground alignment framework, which is composed of a background activation suppression (BAS) module, a foreground object alignment (FOA) module, and a local to local (L2L) similarity metric.
Experiments conducted on multiple popular fine-grained benchmarks demonstrate that our method outperforms the existing state-of-the-art by a large margin.
arXiv Detail & Related papers (2022-10-04T07:54:40Z) - A Broader Picture of Random-walk Based Graph Embedding [2.6546685109604304]
Graph embedding based on random-walks supports effective solutions for many graph-related downstream tasks.
We develop an analytical framework for random-walk based graph embedding that consists of three components: a random-walk process, a similarity function, and an embedding algorithm.
We show that embeddings based on autocovariance similarity, when paired with dot product ranking for link prediction, outperform state-of-the-art methods based on Pointwise Mutual Information similarity by up to 100%.
arXiv Detail & Related papers (2021-10-24T03:40:16Z) - Batch Curation for Unsupervised Contrastive Representation Learning [21.83249229426828]
We introduce a textitbatch curation scheme that selects batches during the training process that are more inline with the underlying contrastive objective.
We provide insights into what constitutes beneficial similar and dissimilar pairs as well as validate textitbatch curation on CIFAR10.
arXiv Detail & Related papers (2021-08-19T12:14:50Z) - Diverse Semantic Image Synthesis via Probability Distribution Modeling [103.88931623488088]
We propose a novel diverse semantic image synthesis framework.
Our method can achieve superior diversity and comparable quality compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-03-11T18:59:25Z) - Learning to Compose Hypercolumns for Visual Correspondence [57.93635236871264]
We introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match.
The proposed method, dubbed Dynamic Hyperpixel Flow, learns to compose hypercolumn features on the fly by selecting a small number of relevant layers from a deep convolutional neural network.
arXiv Detail & Related papers (2020-07-21T04:03:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.