BM-NAS: Bilevel Multimodal Neural Architecture Search
- URL: http://arxiv.org/abs/2104.09379v1
- Date: Mon, 19 Apr 2021 15:09:49 GMT
- Title: BM-NAS: Bilevel Multimodal Neural Architecture Search
- Authors: Yihang Yin, Siyu Huang, Xiang Zhang, Dejing Dou
- Abstract summary: This paper proposes Bilevel Multimodal Neural Architecture Search (BM-NAS) framework.
It makes the architecture of multimodal fusion models fully searchable via a bilevel searching scheme.
BM-NAS achieves competitive performances with much less search time and fewer model parameters.
- Score: 30.472605201814428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks (DNNs) have shown superior performances on various
multimodal learning problems. However, it often requires huge efforts to adapt
DNNs to individual multimodal tasks by manually engineering unimodal features
and designing multimodal feature fusion strategies. This paper proposes Bilevel
Multimodal Neural Architecture Search (BM-NAS) framework, which makes the
architecture of multimodal fusion models fully searchable via a bilevel
searching scheme. At the upper level, BM-NAS selects the inter/intra-modal
feature pairs from the pretrained unimodal backbones. At the lower level,
BM-NAS learns the fusion strategy for each feature pair, which is a combination
of predefined primitive operations. The primitive operations are elaborately
designed and they can be flexibly combined to accommodate various effective
feature fusion modules such as multi-head attention (Transformer) and Attention
on Attention (AoA). Experimental results on three multimodal tasks demonstrate
the effectiveness and efficiency of the proposed BM-NAS framework. BM-NAS
achieves competitive performances with much less search time and fewer model
parameters in comparison with the existing generalized multimodal NAS methods.
Related papers
- Alt-MoE: Multimodal Alignment via Alternating Optimization of Multi-directional MoE with Unimodal Models [7.134682404460003]
We introduce a novel training framework, Alt-MoE, which employs the Mixture of Experts (MoE) as a unified multi-directional connector across modalities.
Our methodology has been validated on several well-performing uni-modal models.
arXiv Detail & Related papers (2024-09-09T10:40:50Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities.
Specifically, it features modality-specific encoders with connectors for a unified multimodal representation.
We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z) - An Evolutionary Network Architecture Search Framework with Adaptive Multimodal Fusion for Hand Gesture Recognition [5.001653808609435]
We propose an evolutionary network architecture search framework with the adaptive multimodel fusion (AMF-ENAS)
AMF-ENAS achieves state-of-the-art performance on the Ninapro DB2, DB3, and DB7 datasets.
arXiv Detail & Related papers (2024-03-27T02:39:23Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Harmonic-NAS: Hardware-Aware Multimodal Neural Architecture Search on
Resource-constrained Devices [0.4915744683251151]
We propose a framework for the joint optimization of unimodal backbones and multimodal fusion networks with hardware awareness on resource-constrained devices.
Harmonic-NAS achieves 10.9% accuracy improvement, 1.91x latency reduction, and 2.14x energy efficiency gain.
arXiv Detail & Related papers (2023-09-12T21:37:26Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - Deep Multimodal Neural Architecture Search [178.35131768344246]
We devise a generalized deep multimodal neural architecture search (MMnas) framework for various multimodal learning tasks.
Given multimodal input, we first define a set of primitive operations, and then construct a deep encoder-decoder based unified backbone.
On top of the unified backbone, we attach task-specific heads to tackle different multimodal learning tasks.
arXiv Detail & Related papers (2020-04-25T07:00:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.