Efficient Deep Feature Calibration for Cross-Modal Joint Embedding
Learning
- URL: http://arxiv.org/abs/2108.00705v1
- Date: Mon, 2 Aug 2021 08:16:58 GMT
- Title: Efficient Deep Feature Calibration for Cross-Modal Joint Embedding
Learning
- Authors: Zhongwei Xie, Ling Liu, Lin Li, Luo Zhong
- Abstract summary: This paper introduces a two-phase deep feature calibration framework for efficient learning of semantics enhanced text-image cross-modal joint embedding.
In preprocessing, we perform deep feature calibration by combining deep feature engineering with semantic context features derived from raw text-image input data.
In joint embedding learning, we perform deep feature calibration by optimizing the batch-hard triplet loss function with soft-margin and double negative sampling.
- Score: 14.070841236184439
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a two-phase deep feature calibration framework for
efficient learning of semantics enhanced text-image cross-modal joint
embedding, which clearly separates the deep feature calibration in data
preprocessing from training the joint embedding model. We use the Recipe1M
dataset for the technical description and empirical validation. In
preprocessing, we perform deep feature calibration by combining deep feature
engineering with semantic context features derived from raw text-image input
data. We leverage LSTM to identify key terms, NLP methods to produce ranking
scores for key terms before generating the key term feature. We leverage
wideResNet50 to extract and encode the image category semantics to help
semantic alignment of the learned recipe and image embeddings in the joint
latent space. In joint embedding learning, we perform deep feature calibration
by optimizing the batch-hard triplet loss function with soft-margin and double
negative sampling, also utilizing the category-based alignment loss and
discriminator-based alignment loss. Extensive experiments demonstrate that our
SEJE approach with the deep feature calibration significantly outperforms the
state-of-the-art approaches.
Related papers
- Fully Differentiable Correlation-driven 2D/3D Registration for X-ray to CT Image Fusion [3.868072865207522]
Image-based rigid 2D/3D registration is a critical technique for fluoroscopic guided surgical interventions.
We propose a novel fully differentiable correlation-driven network using a dual-branch CNN-transformer encoder.
A correlation-driven loss is proposed for low-frequency feature and high-frequency feature decomposition based on embedded information.
arXiv Detail & Related papers (2024-02-04T14:12:51Z) - Segment Any Events via Weighted Adaptation of Pivotal Tokens [85.39087004253163]
This paper focuses on the nuanced challenge of tailoring the Segment Anything Models (SAMs) for integration with event data.
We introduce a multi-scale feature distillation methodology to optimize the alignment of token embeddings originating from event data with their RGB image counterparts.
arXiv Detail & Related papers (2023-12-24T12:47:08Z) - Fundamental Limits of Two-layer Autoencoders, and Achieving Them with
Gradient Methods [91.54785981649228]
This paper focuses on non-linear two-layer autoencoders trained in the challenging proportional regime.
Our results characterize the minimizers of the population risk, and show that such minimizers are achieved by gradient methods.
For the special case of a sign activation function, our analysis establishes the fundamental limits for the lossy compression of Gaussian sources via (shallow) autoencoders.
arXiv Detail & Related papers (2022-12-27T12:37:34Z) - IDEAL: Improved DEnse locAL Contrastive Learning for Semi-Supervised
Medical Image Segmentation [3.6748639131154315]
We extend the concept of metric learning to the segmentation task.
We propose a simple convolutional projection head for obtaining dense pixel-level features.
A bidirectional regularization mechanism involving two-stream regularization training is devised for the downstream task.
arXiv Detail & Related papers (2022-10-26T23:11:02Z) - Dense FixMatch: a simple semi-supervised learning method for pixel-wise
prediction tasks [68.36996813591425]
We propose Dense FixMatch, a simple method for online semi-supervised learning of dense and structured prediction tasks.
We enable the application of FixMatch in semi-supervised learning problems beyond image classification by adding a matching operation on the pseudo-labels.
Dense FixMatch significantly improves results compared to supervised learning using only labeled data, approaching its performance with 1/4 of the labeled samples.
arXiv Detail & Related papers (2022-10-18T15:02:51Z) - Towards Effective Image Manipulation Detection with Proposal Contrastive
Learning [61.5469708038966]
We propose Proposal Contrastive Learning (PCL) for effective image manipulation detection.
Our PCL consists of a two-stream architecture by extracting two types of global features from RGB and noise views respectively.
Our PCL can be easily adapted to unlabeled data in practice, which can reduce manual labeling costs and promote more generalizable features.
arXiv Detail & Related papers (2022-10-16T13:30:13Z) - Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval
with Deep Feature Engineering [13.321319187357844]
This paper introduces a two-phase deep feature engineering framework for efficient learning of semantics enhanced joint embedding.
In preprocessing, we perform deep feature engineering by combining deep feature engineering with semantic context features derived from raw text-image input data.
In joint embedding learning, we perform deep feature engineering by optimizing the batch-hard triplet loss function with soft-margin and double negative sampling.
arXiv Detail & Related papers (2021-10-22T05:18:28Z) - Transductive Few-Shot Classification on the Oblique Manifold [5.115651633703363]
Few-shot learning attempts to learn with limited data.
In this work, we perform the feature extraction in the Euclidean space.
We also propose a non-parametric Region Self-attention with Spatial Pyramid Pooling.
arXiv Detail & Related papers (2021-08-09T13:01:03Z) - An Adaptive Framework for Learning Unsupervised Depth Completion [59.17364202590475]
We present a method to infer a dense depth map from a color image and associated sparse depth measurements.
We show that regularization and co-visibility are related via the fitness of the model to data and can be unified into a single framework.
arXiv Detail & Related papers (2021-06-06T02:27:55Z) - Dense Contrastive Learning for Self-Supervised Visual Pre-Training [102.15325936477362]
We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.
Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only 1% slower)
arXiv Detail & Related papers (2020-11-18T08:42:32Z) - Adaptive Fractional Dilated Convolution Network for Image Aesthetics
Assessment [33.945579916184364]
An adaptive fractional dilated convolution (AFDC) is developed to tackle this issue in convolutional kernel level.
We provide a concise formulation for mini-batch training and utilize a grouping strategy to reduce computational overhead.
Our experimental results demonstrate that our proposed method achieves state-of-the-art performance on image aesthetics assessment over the AVA dataset.
arXiv Detail & Related papers (2020-04-06T21:56:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.