WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model
- URL: http://arxiv.org/abs/2110.14378v1
- Date: Wed, 27 Oct 2021 12:25:21 GMT
- Title: WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model
- Authors: Nanyi Fei, Zhiwu Lu, Yizhao Gao, Guoxing Yang, Yuqi Huo, Jingyuan Wen,
Haoyu Lu, Ruihua Song, Xin Gao, Tao Xiang, Hao Sun and Ji-Rong Wen
- Abstract summary: We develop a novel foundation model pre-trained with huge multimodal (visual and textual) data.
We show that state-of-the-art results can be obtained on a wide range of downstream tasks.
- Score: 74.4875156387271
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The fundamental goal of artificial intelligence (AI) is to mimic the core
cognitive activities of human including perception, memory, and reasoning.
Although tremendous success has been achieved in various AI research fields
(e.g., computer vision and natural language processing), the majority of
existing works only focus on acquiring single cognitive ability (e.g., image
classification, reading comprehension, or visual commonsense reasoning). To
overcome this limitation and take a solid step to artificial general
intelligence (AGI), we develop a novel foundation model pre-trained with huge
multimodal (visual and textual) data, which is able to be quickly adapted for a
broad class of downstream cognitive tasks. Such a model is fundamentally
different from the multimodal foundation models recently proposed in the
literature that typically make strong semantic correlation assumption and
expect exact alignment between image and text modalities in their pre-training
data, which is often hard to satisfy in practice thus limiting their
generalization abilities. To resolve this issue, we propose to pre-train our
foundation model by self-supervised learning with weak semantic correlation
data crawled from the Internet and show that state-of-the-art results can be
obtained on a wide range of downstream tasks (both single-modal and
cross-modal). Particularly, with novel model-interpretability tools developed
in this work, we demonstrate that strong imagination ability (even with hints
of commonsense) is now possessed by our foundation model. We believe our work
makes a transformative stride towards AGI and will have broad impact on various
AI+ fields (e.g., neuroscience and healthcare).
Related papers
- ARPA: A Novel Hybrid Model for Advancing Visual Word Disambiguation Using Large Language Models and Transformers [1.6541870997607049]
We present ARPA, an architecture that fuses the unparalleled contextual understanding of large language models with the advanced feature extraction capabilities of transformers.
ARPA's introduction marks a significant milestone in visual word disambiguation, offering a compelling solution.
We invite researchers and practitioners to explore the capabilities of our model, envisioning a future where such hybrid models drive unprecedented advancements in artificial intelligence.
arXiv Detail & Related papers (2024-08-12T10:15:13Z) - Big Cooperative Learning [7.958840888809145]
We show that the training of foundation models can be interpreted as a form of big cooperative learning.
We propose the BigLearn-GAN, which is a novel adversarially-trained foundation model with versatile data sampling capabilities.
arXiv Detail & Related papers (2024-07-31T03:59:14Z) - A Survey on Vision-Language-Action Models for Embodied AI [71.16123093739932]
Vision-language-action models (VLAs) have become a foundational element in robot learning.
Various methods have been proposed to enhance traits such as versatility, dexterity, and generalizability.
VLAs serve as high-level task planners capable of decomposing long-horizon tasks into executable subtasks.
arXiv Detail & Related papers (2024-05-23T01:43:54Z) - Data Science Principles for Interpretable and Explainable AI [0.7581664835990121]
Interpretable and interactive machine learning aims to make complex models more transparent and controllable.
This review synthesizes key principles from the growing literature in this field.
arXiv Detail & Related papers (2024-05-17T05:32:27Z) - Position Paper: Agent AI Towards a Holistic Intelligence [53.35971598180146]
We emphasize developing Agent AI -- an embodied system that integrates large foundation models into agent actions.
In this paper, we propose a novel large action model to achieve embodied intelligent behavior, the Agent Foundation Model.
arXiv Detail & Related papers (2024-02-28T16:09:56Z) - Imaginations of WALL-E : Reconstructing Experiences with an
Imagination-Inspired Module for Advanced AI Systems [2.452498006404167]
Our system is equipped with an imagination-inspired module that bridges the gap between textual inputs and other modalities.
This leads to unique interpretations of a concept that may differ from human interpretations but are equally valid.
This work represents a significant advancement in the development of imagination-inspired AI systems.
arXiv Detail & Related papers (2023-08-20T20:10:55Z) - Abstract Visual Reasoning Enabled by Language [8.627180519837657]
We propose a general learning-based framework for solving ARC.
It is centered on transforming tasks from the vision to the language domain.
This composition of language and vision allows for pre-trained models to be leveraged at each stage.
arXiv Detail & Related papers (2023-03-07T17:52:46Z) - Multimodal foundation models are better simulators of the human brain [65.10501322822881]
We present a newly-designed multimodal foundation model pre-trained on 15 million image-text pairs.
We find that both visual and lingual encoders trained multimodally are more brain-like compared with unimodal ones.
arXiv Detail & Related papers (2022-08-17T12:36:26Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - DIME: Fine-grained Interpretations of Multimodal Models via Disentangled
Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models.
Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.