IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level
Grounding of Images
- URL: http://arxiv.org/abs/2305.10438v1
- Date: Fri, 12 May 2023 05:34:52 GMT
- Title: IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level
Grounding of Images
- Authors: Varuna Krishna, S Suryavardan, Shreyash Mishra, Sathyanarayanan
Ramamoorthy, Parth Patwa, Megha Chakraborty, Aman Chadha, Amitava Das, Amit
Sheth
- Abstract summary: We introduce a pre-trained joint embedding (JE), named IMAGINATOR, trained on 21K distinct image objects level from 1M image+text pairs.
IMAGINATOR encapsulates three individual representations: (i) object-object co-location, (ii) word-object co-location, and (iii) word-object correlation.
We also evaluate pre-trained IMAGINATOR JEs on three downstream tasks: (i) image captioning, (ii) Image2Tweet, and (iii) text-based image retrieval.
- Score: 2.9174297412129957
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Word embeddings, i.e., semantically meaningful vector representation of
words, are largely influenced by the distributional hypothesis "You shall know
a word by the company it keeps" (Harris, 1954), whereas modern prediction-based
neural network embeddings rely on design choices and hyperparameter
optimization. Word embeddings like Word2Vec, GloVe etc. well capture the
contextuality and real-world analogies but contemporary convolution-based image
embeddings such as VGGNet, AlexNet, etc. do not capture contextual knowledge.
The popular king-queen analogy does not hold true for most commonly used vision
embeddings.
In this paper, we introduce a pre-trained joint embedding (JE), named
IMAGINATOR, trained on 21K distinct image objects level from 1M image+text
pairs. JE is a way to encode multimodal data into a vector space where the text
modality serves as the ground-ing key, which the complementary modality (in
this case, the image) is anchored with. IMAGINATOR encapsulates three
individual representations: (i) object-object co-location, (ii) word-object
co-location, and (iii) word-object correlation. These three ways capture
complementary aspects of the two modalities which are further combined to
obtain the final JEs.
Generated JEs are intrinsically evaluated to assess how well they capture the
contextuality and real-world analogies. We also evaluate pre-trained IMAGINATOR
JEs on three downstream tasks: (i) image captioning, (ii) Image2Tweet, and
(iii) text-based image retrieval. IMAGINATOR establishes a new standard on the
aforementioned down-stream tasks by outperforming the current SoTA on all the
selected tasks. IMAGINATOR will be made publicly available. The codes are
available at https://github.com/varunakk/IMAGINATOR
Related papers
- Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval [53.89454443114146]
We study the zero-shot Composed Image Retrieval (ZS-CIR) task, which is to retrieve the target image given a reference image and a description without training on the triplet datasets.
Previous works generate pseudo-word tokens by projecting the reference image features to the text embedding space.
We propose a Knowledge-Enhanced Dual-stream zero-shot composed image retrieval framework (KEDs)
KEDs implicitly models the attributes of the reference images by incorporating a database.
arXiv Detail & Related papers (2024-03-24T04:23:56Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - CoBIT: A Contrastive Bi-directional Image-Text Generation Model [72.1700346308106]
CoBIT employs a novel unicoder-decoder structure, which attempts to unify three pre-training objectives in one framework.
CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios.
arXiv Detail & Related papers (2023-03-23T17:24:31Z) - I2DFormer: Learning Image to Document Attention for Zero-Shot Image
Classification [123.90912800376039]
Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes.
We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents.
Our method leads to highly interpretable results where document words can be grounded in the image regions.
arXiv Detail & Related papers (2022-09-21T12:18:31Z) - Towards Efficient Cross-Modal Visual Textual Retrieval using
Transformer-Encoder Deep Features [10.163477961551592]
Cross-modal retrieval is an important functionality in modern search engines.
In this paper, we focus on the image-sentence retrieval task.
We use the recently introduced TERN architecture as an image-sentence features extractor.
arXiv Detail & Related papers (2021-06-01T10:11:46Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z) - Deep Multimodal Image-Text Embeddings for Automatic Cross-Media
Retrieval [0.0]
We introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously.
The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking.
arXiv Detail & Related papers (2020-02-23T23:58:04Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.