Multimodal Representation Learning With Text and Images
- URL: http://arxiv.org/abs/2205.00142v1
- Date: Sat, 30 Apr 2022 03:25:01 GMT
- Title: Multimodal Representation Learning With Text and Images
- Authors: Aishwarya Jayagopal, Ankireddy Monica Aiswarya, Ankita Garg,
Srinivasan Kolumam Nandakumar
- Abstract summary: This project leverages multimodal AI and matrix factorization techniques for representation learning, on text and image data simultaneously.
The learnt representations are evaluated using downstream classification and regression tasks.
- Score: 2.998895355715139
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, multimodal AI has seen an upward trend as researchers are
integrating data of different types such as text, images, speech into modelling
to get the best results. This project leverages multimodal AI and matrix
factorization techniques for representation learning, on text and image data
simultaneously, thereby employing the widely used techniques of Natural
Language Processing (NLP) and Computer Vision. The learnt representations are
evaluated using downstream classification and regression tasks. The methodology
adopted can be extended beyond the scope of this project as it uses
Auto-Encoders for unsupervised representation learning.
Related papers
- Multimodal Representation Learning using Adaptive Graph Construction [0.5221459608786241]
Multimodal contrastive learning train neural networks by levergaing data from heterogeneous sources such as images and text.
We propose AutoBIND, a novel contrastive learning framework that can learn representations from an arbitrary number of modalites.
We show that AutoBIND outperforms previous methods on this task, highlighting the generalizablility of the approach.
arXiv Detail & Related papers (2024-10-08T21:57:46Z) - Knowledge-Aware Reasoning over Multimodal Semi-structured Tables [85.24395216111462]
This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data.
We introduce MMTabQA, a new dataset designed for this purpose.
Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs.
arXiv Detail & Related papers (2024-08-25T15:17:43Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - Using Multiple Instance Learning to Build Multimodal Representations [3.354271620160378]
Image-text multimodal representation learning aligns data across modalities and enables important medical applications.
We propose a generic framework for constructing permutation-invariant score functions with many existing multimodal representation learning approaches as special cases.
arXiv Detail & Related papers (2022-12-11T18:01:11Z) - Masked Vision and Language Modeling for Multi-modal Representation
Learning [62.15254888833132]
We study how to use masked signal modeling in vision and language (V+L) representation learning.
We propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality.
Our experiments on various V+L tasks show that the proposed method achieves state-of-the-art performances by using a large amount of data.
arXiv Detail & Related papers (2022-08-03T15:11:01Z) - Unsupervised Multimodal Language Representations using Convolutional
Autoencoders [5.464072883537924]
We propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks.
We map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets.
It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters.
arXiv Detail & Related papers (2021-10-06T18:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.