Related papers: Survey on Self-Supervised Multimodal Representation Learning and Foundation Models

Survey on Self-Supervised Multimodal Representation Learning and Foundation Models

URL: http://arxiv.org/abs/2211.15837v1
Date: Tue, 29 Nov 2022 00:17:43 GMT
Title: Survey on Self-Supervised Multimodal Representation Learning and Foundation Models
Authors: Sushil Thapa
Abstract summary: This paper summarizes some of the landmark research papers that are directly or indirectly responsible to build the foundation of multimodal self-supervised learning of representation today. The paper goes over the development of representation learning over the last few years for each modality and how they were combined to get a multimodal agent later.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Deep learning has been the subject of growing interest in recent years. Specifically, a specific type called Multimodal learning has shown great promise for solving a wide range of problems in domains such as language, vision, audio, etc. One promising research direction to improve this further has been learning rich and robust low-dimensional data representation of the high-dimensional world with the help of large-scale datasets present on the internet. Because of its potential to avoid the cost of annotating large-scale datasets, self-supervised learning has been the de facto standard for this task in recent years. This paper summarizes some of the landmark research papers that are directly or indirectly responsible to build the foundation of multimodal self-supervised learning of representation today. The paper goes over the development of representation learning over the last few years for each modality and how they were combined to get a multimodal agent later.

Related papers

Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality [74.59049806800176]
This demo paper highlights the Tevatron toolkit's key features, bridging academia and industry.<n>We showcase a unified dense retriever achieving strong multilingual and multimodal effectiveness.<n>We also release OmniEmbed, to the best of our knowledge, the first embedding model that unifies text, image document, video, and audio retrieval.
arXiv Detail & Related papers (2025-05-05T08:52:49Z)
Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy [2.294223504228228]
Multimodal learning, a rapidly evolving field in artificial intelligence, seeks to construct more versatile and robust systems. Inspired by the human ability to assimilate information through many senses, this method enables applications such as text-to-video conversion, visual question answering, and image captioning. Recent developments in datasets that support multimodal language models (MLLMs) are highlighted in this overview.
arXiv Detail & Related papers (2024-12-23T18:15:19Z)
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey [93.72125112643596]
Next Token Prediction (NTP) is a versatile training objective for machine learning tasks across various modalities. This survey introduces a comprehensive taxonomy that unifies both understanding and generation within multimodal learning. The proposed taxonomy covers five key aspects: Multimodal tokenization, MMNTP model architectures, unified task representation, datasets & evaluation, and open challenges.
arXiv Detail & Related papers (2024-12-16T05:02:25Z)
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data. We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z)
Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework [58.362064122489166]
This paper introduces the Cross-modal Few-Shot Learning task, which aims to recognize instances across multiple modalities while relying on scarce labeled data. We propose a Generative Transfer Learning framework by simulating how humans abstract and generalize concepts. We show that the GTL achieves state-of-the-art performance across seven multi-modal datasets across RGB-Sketch, RGB-Infrared, and RGB-Depth.
arXiv Detail & Related papers (2024-10-14T16:09:38Z)
LLMs Meet Multimodal Generation and Editing: A Survey [89.76691959033323]
This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio. We summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods. We dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction.
arXiv Detail & Related papers (2024-05-29T17:59:20Z)
Multimodal Large Language Models: A Survey [36.06016060015404]
Multimodal language models integrate multiple data types, such as images, text, language, audio, and other heterogeneity. This paper begins by defining the concept of multimodal and examining the historical development of multimodal algorithms. A practical guide is provided, offering insights into the technical aspects of multimodal models. Lastly, we explore the applications of multimodal models and discuss the challenges associated with their development.
arXiv Detail & Related papers (2023-11-22T05:15:12Z)
UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding [93.92313947913831]
We introduce UniDoc, a novel multimodal model equipped with text detection and recognition capabilities. To the best of our knowledge, this is the first large multimodal model capable of simultaneous text detection, recognition, spotting, and understanding.
arXiv Detail & Related papers (2023-08-19T17:32:34Z)
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey [66.18478838828231]
Multi-modal pre-trained big models have drawn more and more attention in recent years. This paper introduces the background of multi-modal pre-training by reviewing the conventional deep, pre-training works in natural language process, computer vision, and speech. Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network, and knowledge enhanced pre-training.
arXiv Detail & Related papers (2023-02-20T15:34:03Z)
A survey on Self Supervised learning approaches for improving Multimodal representation learning [13.581713668241552]
This paper gives an overview for best self supervised learning approaches for multimodal learning. Cross modal generation, cross modal pretraining, cyclic translation, and generating unimodal labels in self supervised fashion are discussed.
arXiv Detail & Related papers (2022-10-20T05:19:49Z)
Causal Reasoning Meets Visual Representation Learning: A Prospective Study [117.08431221482638]
Lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models. Inspired by the strong inference ability of human-level agents, recent years have witnessed great effort in developing causal reasoning paradigms. This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods.
arXiv Detail & Related papers (2022-04-26T02:22:28Z)
A Review on Methods and Applications in Multimodal Deep Learning [8.152125331009389]
Multimodal deep learning helps to understand and analyze better when various senses are engaged in the processing of information. This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth.
arXiv Detail & Related papers (2022-02-18T13:50:44Z)
Deep Long-Tailed Learning: A Survey [163.16874896812885]
Deep long-tailed learning aims to train well-performing deep models from a large number of images that follow a long-tailed class distribution. Long-tailed class imbalance is a common problem in practical visual recognition tasks. This paper provides a comprehensive survey on recent advances in deep long-tailed learning.
arXiv Detail & Related papers (2021-10-09T15:25:22Z)
Recent Advances and Trends in Multimodal Deep Learning: A Review [9.11022096530605]
Multimodal deep learning aims to create models that can process and link information using various modalities. This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals. A fine-grained taxonomy of various multimodal deep learning applications is proposed, elaborating on different applications in more depth.
arXiv Detail & Related papers (2021-05-24T04:20:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.