Related papers: Multimodal Representation Learning With Text and Images

Multimodal Representation Learning With Text and Images

URL: http://arxiv.org/abs/2205.00142v1
Date: Sat, 30 Apr 2022 03:25:01 GMT
Title: Multimodal Representation Learning With Text and Images
Authors: Aishwarya Jayagopal, Ankireddy Monica Aiswarya, Ankita Garg, Srinivasan Kolumam Nandakumar
Abstract summary: This project leverages multimodal AI and matrix factorization techniques for representation learning, on text and image data simultaneously. The learnt representations are evaluated using downstream classification and regression tasks.
Score: 2.998895355715139
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, multimodal AI has seen an upward trend as researchers are integrating data of different types such as text, images, speech into modelling to get the best results. This project leverages multimodal AI and matrix factorization techniques for representation learning, on text and image data simultaneously, thereby employing the widely used techniques of Natural Language Processing (NLP) and Computer Vision. The learnt representations are evaluated using downstream classification and regression tasks. The methodology adopted can be extended beyond the scope of this project as it uses Auto-Encoders for unsupervised representation learning.

Related papers

Multimodal Representation Learning using Adaptive Graph Construction [0.5221459608786241]
Multimodal contrastive learning train neural networks by levergaing data from heterogeneous sources such as images and text. We propose AutoBIND, a novel contrastive learning framework that can learn representations from an arbitrary number of modalites. We show that AutoBIND outperforms previous methods on this task, highlighting the generalizablility of the approach.
arXiv Detail & Related papers (2024-10-08T21:57:46Z)
From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities [31.108694010274988]
We introduce a novel image tokenizer that bridges this gap by applying the principle of Byte-Pair. Unlike conventional approaches that rely on separate visual encoders, our method directly incorporates structural prior information into image tokens. This innovative approach enables Transformer models to more effectively learn and reason across modalities.
arXiv Detail & Related papers (2024-10-03T02:34:31Z)
Knowledge-Aware Reasoning over Multimodal Semi-structured Tables [85.24395216111462]
This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data. We introduce MMTabQA, a new dataset designed for this purpose. Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs.
arXiv Detail & Related papers (2024-08-25T15:17:43Z)
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting. Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM. To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z)
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models. Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z)
UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC) UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z)
Domain Generalization for Mammographic Image Analysis with Contrastive Learning [62.25104935889111]
The training of an efficacious deep learning model requires large data with diverse styles and qualities. A novel contrastive learning is developed to equip the deep learning models with better style generalization capability. The proposed method has been evaluated extensively and rigorously with mammograms from various vendor style domains and several public datasets.
arXiv Detail & Related papers (2023-04-20T11:40:21Z)
Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework. We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image. We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z)
Using Multiple Instance Learning to Build Multimodal Representations [3.354271620160378]
Image-text multimodal representation learning aligns data across modalities and enables important medical applications. We propose a generic framework for constructing permutation-invariant score functions with many existing multimodal representation learning approaches as special cases.
arXiv Detail & Related papers (2022-12-11T18:01:11Z)
Masked Vision and Language Modeling for Multi-modal Representation Learning [62.15254888833132]
We study how to use masked signal modeling in vision and language (V+L) representation learning. We propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality. Our experiments on various V+L tasks show that the proposed method achieves state-of-the-art performances by using a large amount of data.
arXiv Detail & Related papers (2022-08-03T15:11:01Z)
Unsupervised Multimodal Language Representations using Convolutional Autoencoders [5.464072883537924]
We propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks. We map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets. It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters.
arXiv Detail & Related papers (2021-10-06T18:28:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.