Unsupervised Multimodal Language Representations using Convolutional
Autoencoders
- URL: http://arxiv.org/abs/2110.03007v1
- Date: Wed, 6 Oct 2021 18:28:07 GMT
- Title: Unsupervised Multimodal Language Representations using Convolutional
Autoencoders
- Authors: Panagiotis Koromilas and Theodoros Giannakopoulos
- Abstract summary: We propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks.
We map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets.
It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters.
- Score: 5.464072883537924
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Language Analysis is a demanding area of research, since it is
associated with two requirements: combining different modalities and capturing
temporal information. During the last years, several works have been proposed
in the area, mostly centered around supervised learning in downstream tasks. In
this paper we propose extracting unsupervised Multimodal Language
representations that are universal and can be applied to different tasks.
Towards this end, we map the word-level aligned multimodal sequences to 2-D
matrices and then use Convolutional Autoencoders to learn embeddings by
combining multiple datasets. Extensive experimentation on Sentiment Analysis
(MOSEI) and Emotion Recognition (IEMOCAP) indicate that the learned
representations can achieve near-state-of-the-art performance with just the use
of a Logistic Regression algorithm for downstream classification. It is also
shown that our method is extremely lightweight and can be easily generalized to
other tasks and unseen data with small performance drop and almost the same
number of parameters. The proposed multimodal representation models are
open-sourced and will help grow the applicability of Multimodal Language.
Related papers
- Multi-modal Semantic Understanding with Contrastive Cross-modal Feature
Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment.
Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z) - AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling [115.89786751297348]
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities.
We build a multimodal text-centric dataset for multimodal alignment pre-training.
We show that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities.
arXiv Detail & Related papers (2024-02-19T15:33:10Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Multimodal Graph Learning for Generative Tasks [89.44810441463652]
Multimodal learning combines multiple data modalities, broadening the types and complexity of data our models can utilize.
We propose Multimodal Graph Learning (MMGL), a framework for capturing information from multiple multimodal neighbors with relational structures among them.
arXiv Detail & Related papers (2023-10-11T13:25:03Z) - On the Performance of Multimodal Language Models [4.677125897916577]
This study conducts a comparative analysis of different multimodal instruction tuning approaches.
We reveal key insights for guiding architectural choices when incorporating multimodal capabilities into large language models.
arXiv Detail & Related papers (2023-10-04T23:33:36Z) - Semi-supervised Multimodal Representation Learning through a Global Workspace [2.8948274245812335]
"Global Workspace" is a shared representation for two input modalities.
This architecture is amenable to self-supervised training via cycle-consistency.
We show that such an architecture can be trained to align and translate between two modalities with very little need for matched data.
arXiv Detail & Related papers (2023-06-27T12:41:36Z) - Diffusion Model is an Effective Planner and Data Synthesizer for
Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis.
For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z) - Multilingual Multimodal Learning with Machine Translated Text [27.7207234512674]
We investigate whether machine translating English multimodal data can be an effective proxy for the lack of readily available multilingual data.
We propose two metrics for automatically removing such translations from the resulting datasets.
In experiments on five tasks across 20 languages in the IGLUE benchmark, we show that translated data can provide a useful signal for multilingual multimodal learning.
arXiv Detail & Related papers (2022-10-24T11:41:20Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - METEOR: Learning Memory and Time Efficient Representations from
Multi-modal Data Streams [19.22829945777267]
We present METEOR, a novel MEmory and Time Efficient Online Representation learning technique.
We show that METEOR preserves the quality of the representations while reducing memory usage by around 80% compared to the conventional memory-intensive embeddings.
arXiv Detail & Related papers (2020-07-23T08:18:02Z) - A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine
Translation [131.33610549540043]
We propose a novel graph-based multi-modal fusion encoder for NMT.
We first represent the input sentence and image using a unified multi-modal graph.
We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations.
arXiv Detail & Related papers (2020-07-17T04:06:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.