Survey on Self-Supervised Multimodal Representation Learning and
Foundation Models
- URL: http://arxiv.org/abs/2211.15837v1
- Date: Tue, 29 Nov 2022 00:17:43 GMT
- Title: Survey on Self-Supervised Multimodal Representation Learning and
Foundation Models
- Authors: Sushil Thapa
- Abstract summary: This paper summarizes some of the landmark research papers that are directly or indirectly responsible to build the foundation of multimodal self-supervised learning of representation today.
The paper goes over the development of representation learning over the last few years for each modality and how they were combined to get a multimodal agent later.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Deep learning has been the subject of growing interest in recent years.
Specifically, a specific type called Multimodal learning has shown great
promise for solving a wide range of problems in domains such as language,
vision, audio, etc. One promising research direction to improve this further
has been learning rich and robust low-dimensional data representation of the
high-dimensional world with the help of large-scale datasets present on the
internet. Because of its potential to avoid the cost of annotating large-scale
datasets, self-supervised learning has been the de facto standard for this task
in recent years. This paper summarizes some of the landmark research papers
that are directly or indirectly responsible to build the foundation of
multimodal self-supervised learning of representation today. The paper goes
over the development of representation learning over the last few years for
each modality and how they were combined to get a multimodal agent later.
Related papers
- LLMs Meet Multimodal Generation and Editing: A Survey [89.76691959033323]
This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio.
We summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods.
We dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction.
arXiv Detail & Related papers (2024-05-29T17:59:20Z) - Large Language Models for Generative Information Extraction: A Survey [89.71273968283616]
Information extraction aims to extract structural knowledge from plain natural language texts.
generative Large Language Models (LLMs) have demonstrated remarkable capabilities in text understanding and generation.
LLMs offer viable solutions for IE tasks based on a generative paradigm.
arXiv Detail & Related papers (2023-12-29T14:25:22Z) - Multimodal Large Language Models: A Survey [36.06016060015404]
Multimodal language models integrate multiple data types, such as images, text, language, audio, and other heterogeneity.
This paper begins by defining the concept of multimodal and examining the historical development of multimodal algorithms.
A practical guide is provided, offering insights into the technical aspects of multimodal models.
Lastly, we explore the applications of multimodal models and discuss the challenges associated with their development.
arXiv Detail & Related papers (2023-11-22T05:15:12Z) - UniDoc: A Universal Large Multimodal Model for Simultaneous Text
Detection, Recognition, Spotting and Understanding [93.92313947913831]
We introduce UniDoc, a novel multimodal model equipped with text detection and recognition capabilities.
To the best of our knowledge, this is the first large multimodal model capable of simultaneous text detection, recognition, spotting, and understanding.
arXiv Detail & Related papers (2023-08-19T17:32:34Z) - Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey [66.18478838828231]
Multi-modal pre-trained big models have drawn more and more attention in recent years.
This paper introduces the background of multi-modal pre-training by reviewing the conventional deep, pre-training works in natural language process, computer vision, and speech.
Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network, and knowledge enhanced pre-training.
arXiv Detail & Related papers (2023-02-20T15:34:03Z) - A survey on Self Supervised learning approaches for improving Multimodal
representation learning [13.581713668241552]
This paper gives an overview for best self supervised learning approaches for multimodal learning.
Cross modal generation, cross modal pretraining, cyclic translation, and generating unimodal labels in self supervised fashion are discussed.
arXiv Detail & Related papers (2022-10-20T05:19:49Z) - Causal Reasoning Meets Visual Representation Learning: A Prospective
Study [117.08431221482638]
Lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models.
Inspired by the strong inference ability of human-level agents, recent years have witnessed great effort in developing causal reasoning paradigms.
This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods.
arXiv Detail & Related papers (2022-04-26T02:22:28Z) - A Review on Methods and Applications in Multimodal Deep Learning [8.152125331009389]
Multimodal deep learning helps to understand and analyze better when various senses are engaged in the processing of information.
This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals.
A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth.
arXiv Detail & Related papers (2022-02-18T13:50:44Z) - Deep Long-Tailed Learning: A Survey [163.16874896812885]
Deep long-tailed learning aims to train well-performing deep models from a large number of images that follow a long-tailed class distribution.
Long-tailed class imbalance is a common problem in practical visual recognition tasks.
This paper provides a comprehensive survey on recent advances in deep long-tailed learning.
arXiv Detail & Related papers (2021-10-09T15:25:22Z) - Recent Advances and Trends in Multimodal Deep Learning: A Review [9.11022096530605]
Multimodal deep learning aims to create models that can process and link information using various modalities.
This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals.
A fine-grained taxonomy of various multimodal deep learning applications is proposed, elaborating on different applications in more depth.
arXiv Detail & Related papers (2021-05-24T04:20:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.