Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval
with Deep Feature Engineering
- URL: http://arxiv.org/abs/2110.11592v1
- Date: Fri, 22 Oct 2021 05:18:28 GMT
- Title: Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval
with Deep Feature Engineering
- Authors: Zhongwei Xie, Ling Liu, Yanzhao Wu, Luo Zhong, Lin Li
- Abstract summary: This paper introduces a two-phase deep feature engineering framework for efficient learning of semantics enhanced joint embedding.
In preprocessing, we perform deep feature engineering by combining deep feature engineering with semantic context features derived from raw text-image input data.
In joint embedding learning, we perform deep feature engineering by optimizing the batch-hard triplet loss function with soft-margin and double negative sampling.
- Score: 13.321319187357844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a two-phase deep feature engineering framework for
efficient learning of semantics enhanced joint embedding, which clearly
separates the deep feature engineering in data preprocessing from training the
text-image joint embedding model. We use the Recipe1M dataset for the technical
description and empirical validation. In preprocessing, we perform deep feature
engineering by combining deep feature engineering with semantic context
features derived from raw text-image input data. We leverage LSTM to identify
key terms, deep NLP models from the BERT family, TextRank, or TF-IDF to produce
ranking scores for key terms before generating the vector representation for
each key term by using word2vec. We leverage wideResNet50 and word2vec to
extract and encode the image category semantics of food images to help semantic
alignment of the learned recipe and image embeddings in the joint latent space.
In joint embedding learning, we perform deep feature engineering by optimizing
the batch-hard triplet loss function with soft-margin and double negative
sampling, taking into account also the category-based alignment loss and
discriminator-based alignment loss. Extensive experiments demonstrate that our
SEJE approach with deep feature engineering significantly outperforms the
state-of-the-art approaches.
Related papers
- SwinMTL: A Shared Architecture for Simultaneous Depth Estimation and Semantic Segmentation from Monocular Camera Images [4.269350826756809]
This research paper presents an innovative multi-task learning framework that allows concurrent depth estimation and semantic segmentation using a single camera.
The proposed approach is based on a shared encoder-decoder architecture, which integrates various techniques to improve the accuracy of the depth estimation and semantic segmentation task without compromising computational efficiency.
The framework is thoroughly evaluated on two datasets - the outdoor Cityscapes dataset and the indoor NYU Depth V2 dataset - and it outperforms existing state-of-the-art methods in both segmentation and depth estimation tasks.
arXiv Detail & Related papers (2024-03-15T20:04:27Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Semantics-Depth-Symbiosis: Deeply Coupled Semi-Supervised Learning of
Semantics and Depth [83.94528876742096]
We tackle the MTL problem of two dense tasks, ie, semantic segmentation and depth estimation, and present a novel attention module called Cross-Channel Attention Module (CCAM)
In a true symbiotic spirit, we then formulate a novel data augmentation for the semantic segmentation task using predicted depth called AffineMix, and a simple depth augmentation using predicted semantics called ColorAug.
Finally, we validate the performance gain of the proposed method on the Cityscapes dataset, which helps us achieve state-of-the-art results for a semi-supervised joint model based on depth and semantic
arXiv Detail & Related papers (2022-06-21T17:40:55Z) - Transformer-Based Approach for Joint Handwriting and Named Entity
Recognition in Historical documents [1.7491858164568674]
This work presents the first approach that adopts the transformer networks for named entity recognition in handwritten documents.
We achieve the new state-of-the-art performance in the ICDAR 2017 Information Extraction competition using the Esposalles database.
arXiv Detail & Related papers (2021-12-08T09:26:21Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Improving Semi-Supervised and Domain-Adaptive Semantic Segmentation with
Self-Supervised Depth Estimation [94.16816278191477]
We present a framework for semi-adaptive and domain-supervised semantic segmentation.
It is enhanced by self-supervised monocular depth estimation trained only on unlabeled image sequences.
We validate the proposed model on the Cityscapes dataset.
arXiv Detail & Related papers (2021-08-28T01:33:38Z) - Efficient Deep Feature Calibration for Cross-Modal Joint Embedding
Learning [14.070841236184439]
This paper introduces a two-phase deep feature calibration framework for efficient learning of semantics enhanced text-image cross-modal joint embedding.
In preprocessing, we perform deep feature calibration by combining deep feature engineering with semantic context features derived from raw text-image input data.
In joint embedding learning, we perform deep feature calibration by optimizing the batch-hard triplet loss function with soft-margin and double negative sampling.
arXiv Detail & Related papers (2021-08-02T08:16:58Z) - A Robust Deep Ensemble Classifier for Figurative Language Detection [1.3124513975412255]
Figurative Language (FL) recognition is an open problem of Sentiment Analysis in the broader field of Natural Language Processing (NLP)
The problem itself contains three interrelated FL recognition tasks: sarcasm, irony and metaphor which, in the present paper, are dealt with advanced Deep Learning (DL) techniques.
The Deep Soft Ensemble (DESC) model achieves a very good performance, worthy of comparison with relevant methodologies and state-of-the-art technologies in the challenging field of FL recognition.
arXiv Detail & Related papers (2021-07-09T11:26:37Z) - Three Ways to Improve Semantic Segmentation with Self-Supervised Depth
Estimation [90.87105131054419]
We present a framework for semi-supervised semantic segmentation, which is enhanced by self-supervised monocular depth estimation from unlabeled image sequences.
We validate the proposed model on the Cityscapes dataset, where all three modules demonstrate significant performance gains.
arXiv Detail & Related papers (2020-12-19T21:18:03Z) - Scene Text Synthesis for Efficient and Effective Deep Network Training [62.631176120557136]
We develop an innovative image synthesis technique that composes annotated training images by embedding foreground objects of interest into background images.
The proposed technique consists of two key components that in principle boost the usefulness of the synthesized images in deep network training.
Experiments over a number of public datasets demonstrate the effectiveness of our proposed image synthesis technique.
arXiv Detail & Related papers (2019-01-26T10:15:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.