Does a Technique for Building Multimodal Representation Matter? --
Comparative Analysis
- URL: http://arxiv.org/abs/2206.06367v1
- Date: Thu, 9 Jun 2022 21:30:10 GMT
- Title: Does a Technique for Building Multimodal Representation Matter? --
Comparative Analysis
- Authors: Maciej Paw{\l}owski, Anna Wr\'oblewska, Sylwia Sysko-Roma\'nczuk
- Abstract summary: We show that the choice of the technique for building multimodal representation is crucial to obtain the highest possible model's performance.
Experiments are conducted on three datasets: Amazon Reviews, MovieLens25M, and MovieLens1M.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Creating a meaningful representation by fusing single modalities (e.g., text,
images, or audio) is the core concept of multimodal learning. Although several
techniques for building multimodal representations have been proven successful,
they have not been compared yet. Therefore it has been ambiguous which
technique can be expected to yield the best results in a given scenario and
what factors should be considered while choosing such a technique. This paper
explores the most common techniques for building multimodal data
representations -- the late fusion, the early fusion, and the sketch, and
compares them in classification tasks. Experiments are conducted on three
datasets: Amazon Reviews, MovieLens25M, and MovieLens1M datasets. In general,
our results confirm that multimodal representations are able to boost the
performance of unimodal models from 0.919 to 0.969 of accuracy on Amazon
Reviews and 0.907 to 0.918 of AUC on MovieLens25M. However, experiments on both
MovieLens datasets indicate the importance of the meaningful input data to the
given task. In this article, we show that the choice of the technique for
building multimodal representation is crucial to obtain the highest possible
model's performance, that comes with the proper modalities combination. Such
choice relies on: the influence that each modality has on the analyzed machine
learning (ML) problem; the type of the ML task; the memory constraints while
training and predicting phase.
Related papers
- DFIMat: Decoupled Flexible Interactive Matting in Multi-Person Scenarios [32.77825044757212]
We propose DFIMat, a decoupled framework that enables flexible interactive matting.
Specifically, we first decouple the task into 2 sub-ones: localizing target instances by understanding scene semantics and the flexible user inputs, and conducting refinement for instance-level matting.
We observe a clear performance gain from decoupling, as it makes sub-tasks easier to learn, and the flexible multi-type input further enhances both effectiveness and efficiency.
arXiv Detail & Related papers (2024-10-13T10:02:58Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - An Empirical Study of Multimodal Model Merging [148.48412442848795]
Model merging is a technique that fuses multiple models trained on different tasks to generate a multi-task solution.
We conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture.
We propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes.
arXiv Detail & Related papers (2023-04-28T15:43:21Z) - Unimodal Training-Multimodal Prediction: Cross-modal Federated Learning
with Hierarchical Aggregation [16.308470947384134]
HA-Fedformer is a novel transformer-based model that empowers unimodal training with only a unimodal dataset at the client.
We develop an uncertainty-aware aggregation method for the local encoders with layer-wise Markov Chain Monte Carlo sampling.
Our experiments on popular sentiment analysis benchmarks, CMU-MOSI and CMU-MOSEI, demonstrate that HA-Fedformer significantly outperforms state-of-the-art multimodal models.
arXiv Detail & Related papers (2023-03-27T07:07:33Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - CLMLF:A Contrastive Learning and Multi-Layer Fusion Method for
Multimodal Sentiment Detection [24.243349217940274]
We propose a Contrastive Learning and Multi-Layer Fusion (CLMLF) method for multimodal sentiment detection.
Specifically, we first encode text and image to obtain hidden representations, and then use a multi-layer fusion module to align and fuse the token-level features of text and image.
In addition to the sentiment analysis task, we also designed two contrastive learning tasks, label based contrastive learning and data based contrastive learning tasks.
arXiv Detail & Related papers (2022-04-12T04:03:06Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Improving Multimodal Fusion with Hierarchical Mutual Information
Maximization for Multimodal Sentiment Analysis [16.32509144501822]
We propose a framework named MultiModal InfoMax (MMIM), which hierarchically maximizes the Mutual Information (MI) in unimodal input pairs.
The framework is jointly trained with the main task (MSA) to improve the performance of the downstream MSA task.
arXiv Detail & Related papers (2021-09-01T14:45:16Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - What Makes Multimodal Learning Better than Single (Provably) [28.793128982222438]
We show that learning with multiple modalities achieves a smaller population risk thanonly using its subset of modalities.
This is the first theoretical treatment to capture important qualitative phenomenaobserved in real multimodal applications.
arXiv Detail & Related papers (2021-06-08T17:20:02Z) - Multi-Task Learning for Dense Prediction Tasks: A Survey [87.66280582034838]
Multi-task learning (MTL) techniques have shown promising results w.r.t. performance, computations and/or memory footprint.
We provide a well-rounded view on state-of-the-art deep learning approaches for MTL in computer vision.
arXiv Detail & Related papers (2020-04-28T09:15:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.