New Ideas and Trends in Deep Multimodal Content Understanding: A Review
- URL: http://arxiv.org/abs/2010.08189v1
- Date: Fri, 16 Oct 2020 06:50:54 GMT
- Title: New Ideas and Trends in Deep Multimodal Content Understanding: A Review
- Authors: Wei Chen and Weiping Wang and Li Liu and Michael S. Lew
- Abstract summary: The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text.
This paper will examine recent multimodal deep models and structures, including auto-encoders, generative adversarial nets and their variants.
- Score: 24.576001583494445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The focus of this survey is on the analysis of two modalities of multimodal
deep learning: image and text. Unlike classic reviews of deep learning where
monomodal image classifiers such as VGG, ResNet and Inception module are
central topics, this paper will examine recent multimodal deep models and
structures, including auto-encoders, generative adversarial nets and their
variants. These models go beyond the simple image classifiers in which they can
do uni-directional (e.g. image captioning, image generation) and bi-directional
(e.g. cross-modal retrieval, visual question answering) multimodal tasks.
Besides, we analyze two aspects of the challenge in terms of better content
understanding in deep multimodal applications. We then introduce current ideas
and trends in deep multimodal feature learning, such as feature embedding
approaches and objective function design, which are crucial in overcoming the
aforementioned challenges. Finally, we include several promising directions for
future research.
Related papers
- Detecting Misinformation in Multimedia Content through Cross-Modal Entity Consistency: A Dual Learning Approach [10.376378437321437]
We propose a Multimedia Misinformation Detection framework for detecting misinformation from video content by leveraging cross-modal entity consistency.
Our results demonstrate that MultiMD outperforms state-of-the-art baseline models.
arXiv Detail & Related papers (2024-08-16T16:14:36Z) - Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives [56.2139730920855]
We present a systematic analysis of MM-VUFMs specifically designed for road scenes.
Our objective is to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques.
We provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
arXiv Detail & Related papers (2024-02-05T12:47:09Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Multimodal Foundation Models: From Specialists to General-Purpose
Assistants [187.72038587829223]
The research landscape encompasses five core topics, categorized into two classes.
The target audiences of the paper are researchers, graduate students, and professionals in computer vision and vision-language multimodal communities.
arXiv Detail & Related papers (2023-09-18T17:56:28Z) - Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
Detection [72.36017150922504]
We propose a multi-modal contextual knowledge distillation framework, MMC-Det, to transfer the learned contextual knowledge from a teacher fusion transformer to a student detector.
The diverse multi-modal masked language modeling is realized by an object divergence constraint upon traditional multi-modal masked language modeling (MLM)
arXiv Detail & Related papers (2023-08-30T08:33:13Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Multimodality Representation Learning: A Survey on Evolution,
Pretraining and Its Applications [47.501121601856795]
Multimodality Representation Learning is a technique of learning to embed information from different modalities and their correlations.
Cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task.
This survey presents the literature on the evolution and enhancement of deep learning multimodal architectures.
arXiv Detail & Related papers (2023-02-01T11:48:34Z) - FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z) - Contrastive Cross-Modal Knowledge Sharing Pre-training for
Vision-Language Representation Learning and Retrieval [12.30468719055037]
A Contrastive Cross-Modal Knowledge Sharing Pre-training (COOKIE) is developed to grasp the joint text-image representations.
The first module is a weight-sharing transformer that builds on the head of the visual and textual encoders.
The other one is three specially designed contrastive learning, aiming to share knowledge between different models.
arXiv Detail & Related papers (2022-07-02T04:08:44Z) - M2Lens: Visualizing and Explaining Multimodal Models for Sentiment
Analysis [28.958168542624062]
We present an interactive visual analytics system, M2Lens, to visualize and explain multimodal models for sentiment analysis.
M2Lens provides explanations on intra- and inter-modal interactions at the global, subset, and local levels.
arXiv Detail & Related papers (2021-07-17T15:54:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.