Meta-Transformer: A Unified Framework for Multimodal Learning
- URL: http://arxiv.org/abs/2307.10802v1
- Date: Thu, 20 Jul 2023 12:10:29 GMT
- Title: Meta-Transformer: A Unified Framework for Multimodal Learning
- Authors: Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao,
Wanli Ouyang, Xiangyu Yue
- Abstract summary: Multimodal learning aims to build models that process and relate information from multiple modalities.
Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities.
We propose a framework, named Meta-Transformer, that leverages a $textbffrozen$ encoder to perform multimodal perception.
- Score: 105.77219833997962
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal learning aims to build models that can process and relate
information from multiple modalities. Despite years of development in this
field, it still remains challenging to design a unified network for processing
various modalities ($\textit{e.g.}$ natural language, 2D images, 3D point
clouds, audio, video, time series, tabular data) due to the inherent gaps among
them. In this work, we propose a framework, named Meta-Transformer, that
leverages a $\textbf{frozen}$ encoder to perform multimodal perception without
any paired multimodal training data. In Meta-Transformer, the raw input data
from various modalities are mapped into a shared token space, allowing a
subsequent encoder with frozen parameters to extract high-level semantic
features of the input data. Composed of three main components: a unified data
tokenizer, a modality-shared encoder, and task-specific heads for downstream
tasks, Meta-Transformer is the first framework to perform unified learning
across 12 modalities with unpaired data. Experiments on different benchmarks
reveal that Meta-Transformer can handle a wide range of tasks including
fundamental perception (text, image, point cloud, audio, video), practical
application (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph,
tabular, and time-series). Meta-Transformer indicates a promising future for
developing unified multimodal intelligence with transformers. Code will be
available at https://github.com/invictus717/MetaTransformer
Related papers
- Exchanging-based Multimodal Fusion with Transformer [19.398692598523454]
We study the problem of multimodal fusion in this paper.
Recent exchanging-based methods have been proposed for vision-vision fusion, which aim to exchange embeddings learned from one modality to the other.
We propose a novel exchanging-based multimodal fusion model MuSE for text-vision fusion based on Transformer.
arXiv Detail & Related papers (2023-09-05T12:48:25Z) - Transformer-Based Visual Segmentation: A Survey [118.01564082499948]
Visual segmentation seeks to partition images, video frames, or point clouds into multiple segments or groups.
Transformers are a type of neural network based on self-attention originally designed for natural language processing.
Transformers offer robust, unified, and even simpler solutions for various segmentation tasks.
arXiv Detail & Related papers (2023-04-19T17:59:02Z) - TransVG++: End-to-End Visual Grounding with Language Conditioned Vision
Transformer [188.00681648113223]
We explore neat yet effective Transformer-based frameworks for visual grounding.
TransVG establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates.
We upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding.
arXiv Detail & Related papers (2022-06-14T06:27:38Z) - Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge
Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge.
MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z) - Multimodal Token Fusion for Vision Transformers [54.81107795090239]
We propose a multimodal token fusion method (TokenFusion) for transformer-based vision tasks.
To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features.
The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact.
arXiv Detail & Related papers (2022-04-19T07:47:50Z) - Multi-modal Transformer for Video Retrieval [67.86763073161012]
We present a multi-modal transformer to jointly encode the different modalities in video.
On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer.
This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets.
arXiv Detail & Related papers (2020-07-21T07:38:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.