Deep Multimodal Image-Text Embeddings for Automatic Cross-Media
Retrieval
- URL: http://arxiv.org/abs/2002.10016v1
- Date: Sun, 23 Feb 2020 23:58:04 GMT
- Title: Deep Multimodal Image-Text Embeddings for Automatic Cross-Media
Retrieval
- Authors: Hadi Abdi Khojasteh (1), Ebrahim Ansari (1 and 2), Parvin Razzaghi (1
and 3), Akbar Karimi (4) ((1) Institute for Advanced Studies in Basic
Sciences (IASBS), Zanjan, Iran, (2) Faculty of Mathematics and Physics,
Institute of Formal and Applied Linguistics, Charles University, Czechia, (3)
Institute for Research in Fundamental Sciences (IPM), Tehran, Iran, (4) IMP
Lab, Department of Engineering and Architecture, University of Parma, Parma,
Italy)
- Abstract summary: We introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously.
The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper considers the task of matching images and sentences by learning a
visual-textual embedding space for cross-modal retrieval. Finding such a space
is a challenging task since the features and representations of text and image
are not comparable. In this work, we introduce an end-to-end deep multimodal
convolutional-recurrent network for learning both vision and language
representations simultaneously to infer image-text similarity. The model learns
which pairs are a match (positive) and which ones are a mismatch (negative)
using a hinge-based triplet ranking. To learn about the joint representations,
we leverage our newly extracted collection of tweets from Twitter. The main
characteristic of our dataset is that the images and tweets are not
standardized the same as the benchmarks. Furthermore, there can be a higher
semantic correlation between the pictures and tweets contrary to benchmarks in
which the descriptions are well-organized. Experimental results on MS-COCO
benchmark dataset show that our model outperforms certain methods presented
previously and has competitive performance compared to the state-of-the-art.
The code and dataset have been made available publicly.
Related papers
- Image-Text Matching with Multi-View Attention [1.92360022393132]
Existing two-stream models for image-text matching show good performance while ensuring retrieval speed.
We propose a multi-view attention approach for two-stream image-text matching MVAM (textbfMulti-textbfView textbfAttention textbfModel)
Experiment results on MSCOCO and Flickr30K show that our proposed model brings improvements over existing models.
arXiv Detail & Related papers (2024-02-27T06:11:54Z) - Towards Better Multi-modal Keyphrase Generation via Visual Entity
Enhancement and Multi-granularity Image Noise Filtering [79.44443231700201]
Multi-modal keyphrase generation aims to produce a set of keyphrases that represent the core points of the input text-image pair.
The input text and image are often not perfectly matched, and thus the image may introduce noise into the model.
We propose a novel multi-modal keyphrase generation model, which not only enriches the model input with external knowledge, but also effectively filters image noise.
arXiv Detail & Related papers (2023-09-09T09:41:36Z) - Rethinking Benchmarks for Cross-modal Image-text Retrieval [44.31783230767321]
Cross-modal semantic understanding and matching is a major challenge in image-text retrieval.
In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching.
We propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort.
The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding.
arXiv Detail & Related papers (2023-04-21T09:07:57Z) - Borrowing Human Senses: Comment-Aware Self-Training for Social Media
Multimodal Classification [5.960550152906609]
We capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity.
The classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales.
The results show that our method further advances the performance of previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-27T08:59:55Z) - ALADIN: Distilling Fine-grained Alignment Scores for Efficient
Image-Text Matching and Retrieval [51.588385824875886]
Cross-modal retrieval consists in finding images related to a given query text or vice-versa.
Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks.
This paper proposes an ALign And DIstill Network (ALADIN) to fill in the gap between effectiveness and efficiency.
arXiv Detail & Related papers (2022-07-29T16:01:48Z) - Two-stream Hierarchical Similarity Reasoning for Image-text Matching [66.43071159630006]
A hierarchical similarity reasoning module is proposed to automatically extract context information.
Previous approaches only consider learning single-stream similarity alignment.
A two-stream architecture is developed to decompose image-text matching into image-to-text level and text-to-image level similarity computation.
arXiv Detail & Related papers (2022-03-10T12:56:10Z) - Is An Image Worth Five Sentences? A New Look into Semantics for
Image-Text Matching [10.992151305603267]
We propose two metrics that evaluate the degree of semantic relevance of retrieved items, independently of their annotated binary relevance.
We incorporate a novel strategy that uses an image captioning metric, CIDEr, to define a Semantic Adaptive Margin (SAM) to be optimized in a standard triplet loss.
arXiv Detail & Related papers (2021-10-06T09:54:28Z) - Towards Efficient Cross-Modal Visual Textual Retrieval using
Transformer-Encoder Deep Features [10.163477961551592]
Cross-modal retrieval is an important functionality in modern search engines.
In this paper, we focus on the image-sentence retrieval task.
We use the recently introduced TERN architecture as an image-sentence features extractor.
arXiv Detail & Related papers (2021-06-01T10:11:46Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Learning to Compare Relation: Semantic Alignment for Few-Shot Learning [48.463122399494175]
We present a novel semantic alignment model to compare relations, which is robust to content misalignment.
We conduct extensive experiments on several few-shot learning datasets.
arXiv Detail & Related papers (2020-02-29T08:37:02Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.