Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking
- URL: http://arxiv.org/abs/2309.08154v2
- Date: Thu, 21 Dec 2023 03:53:38 GMT
- Title: Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking
- Authors: Wenzhang Wei, Zhipeng Gui, Changguang Wu, Anqi Zhao, Dehua Peng, Huayi
Wu
- Abstract summary: We propose a Dynamic Visual Semantic Sub-Embeddings framework (DVSE) to reduce the information entropy.
To encourage the generated candidate embeddings to capture various semantic variations, we construct a mixed distribution.
We compare the performance with existing set-based method using four image feature encoders and two text feature encoders on three benchmark datasets.
- Score: 0.5242869847419834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The core of cross-modal matching is to accurately measure the similarity
between different modalities in a unified representation space. However,
compared to textual descriptions of a certain perspective, the visual modality
has more semantic variations. So, images are usually associated with multiple
textual captions in databases. Although popular symmetric embedding methods
have explored numerous modal interaction approaches, they often learn toward
increasing the average expression probability of multiple semantic variations
within image embeddings. Consequently, information entropy in embeddings is
increased, resulting in redundancy and decreased accuracy. In this work, we
propose a Dynamic Visual Semantic Sub-Embeddings framework (DVSE) to reduce the
information entropy. Specifically, we obtain a set of heterogeneous visual
sub-embeddings through dynamic orthogonal constraint loss. To encourage the
generated candidate embeddings to capture various semantic variations, we
construct a mixed distribution and employ a variance-aware weighting loss to
assign different weights to the optimization process. In addition, we develop a
Fast Re-ranking strategy (FR) to efficiently evaluate the retrieval results and
enhance the performance. We compare the performance with existing set-based
method using four image feature encoders and two text feature encoders on three
benchmark datasets: MSCOCO, Flickr30K and CUB Captions. We also show the role
of different components by ablation studies and perform a sensitivity analysis
of the hyperparameters. The qualitative analysis of visualized bidirectional
retrieval and attention maps further demonstrates the ability of our method to
encode semantic variations.
Related papers
- DEMO: A Statistical Perspective for Efficient Image-Text Matching [32.256725860652914]
We introduce Distribution-based Structure Mining with Consistency Learning (DEMO) for efficient image-text matching.
DEMO characterizes each image using multiple augmented views, which are considered as samples drawn from its intrinsic semantic distribution.
In addition, we introduce collaborative consistency learning which not only preserves the similarity structure in the Hamming space but also encourages consistency between retrieval distribution from different directions.
arXiv Detail & Related papers (2024-05-19T09:38:56Z) - Language Guided Domain Generalized Medical Image Segmentation [68.93124785575739]
Single source domain generalization holds promise for more reliable and consistent image segmentation across real-world clinical settings.
We propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features.
Our approach achieves favorable performance against existing methods in literature.
arXiv Detail & Related papers (2024-04-01T17:48:15Z) - Fine-grained Recognition with Learnable Semantic Data Augmentation [68.48892326854494]
Fine-grained image recognition is a longstanding computer vision challenge.
We propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem.
Our method significantly improves the generalization performance on several popular classification networks.
arXiv Detail & Related papers (2023-09-01T11:15:50Z) - Feature Decoupling-Recycling Network for Fast Interactive Segmentation [79.22497777645806]
Recent interactive segmentation methods iteratively take source image, user guidance and previously predicted mask as the input.
We propose the Feature Decoupling-Recycling Network (FDRN), which decouples the modeling components based on their intrinsic discrepancies.
arXiv Detail & Related papers (2023-08-07T12:26:34Z) - Deep Diversity-Enhanced Feature Representation of Hyperspectral Images [87.47202258194719]
We rectify 3D convolution by modifying its topology to enhance the rank upper-bound.
We also propose a novel diversity-aware regularization (DA-Reg) term that acts on the feature maps to maximize independence among elements.
To demonstrate the superiority of the proposed Re$3$-ConvSet and DA-Reg, we apply them to various HS image processing and analysis tasks.
arXiv Detail & Related papers (2023-01-15T16:19:18Z) - Improving Cross-Modal Retrieval with Set of Diverse Embeddings [19.365974066256026]
Cross-modal retrieval across image and text modalities is a challenging task due to its inherent ambiguity.
Set-based embedding has been studied as a solution to this problem.
We present a novel set-based embedding method, which is distinct from previous work in two aspects.
arXiv Detail & Related papers (2022-11-30T05:59:23Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - MPI: Multi-receptive and Parallel Integration for Salient Object
Detection [17.32228882721628]
The semantic representation of deep features is essential for image context understanding.
In this paper, a novel method called MPI is proposed for salient object detection.
The proposed method outperforms state-of-the-art methods under different evaluation metrics.
arXiv Detail & Related papers (2021-08-08T12:01:44Z) - Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with
Adversarial Discriminative Domain Regularization [21.904563910555368]
We propose a novel learning framework to construct a set of discriminative data domains within each image-text pairs.
Our approach can generally improve the learning efficiency and the performance of existing metrics learning frameworks.
arXiv Detail & Related papers (2020-10-23T01:48:37Z) - Learning to Compose Hypercolumns for Visual Correspondence [57.93635236871264]
We introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match.
The proposed method, dubbed Dynamic Hyperpixel Flow, learns to compose hypercolumn features on the fly by selecting a small number of relevant layers from a deep convolutional neural network.
arXiv Detail & Related papers (2020-07-21T04:03:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.