Related papers: UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths

UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths

URL: http://arxiv.org/abs/2502.06474v1
Date: Mon, 10 Feb 2025 13:52:52 GMT
Title: UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths
Authors: Weijia Mao, Zhenheng Yang, Mike Zheng Shou,
Abstract summary: UniMoD is a task-aware token pruning method that employs a separate router for each task to determine which tokens should be pruned.<n>We apply our method to Show-o and Emu3, reducing training FLOPs by approximately 15% in Show-o and 40% in Emu3, while maintaining or improving performance on several benchmarks.
Score: 17.68867710994329
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unified multimodal transformers, which handle both generation and understanding tasks within a shared parameter space, have received increasing attention in recent research. Although various unified transformers have been proposed, training these models is costly due to redundant tokens and heavy attention computation. In the past, studies on large language models have demonstrated that token pruning methods, such as Mixture of Depths (MoD), can significantly improve computational efficiency. MoD employs a router to select the most important ones for processing within a transformer layer. However, directly applying MoD-based token pruning to unified transformers will result in suboptimal performance because different tasks exhibit varying levels of token redundancy. In our work, we analyze the unified transformers by (1) examining attention weight patterns, (2) evaluating the layer importance and token redundancy, and (3) analyzing task interactions. Our findings reveal that token redundancy is primarily influenced by different tasks and layers. Building on these findings, we introduce UniMoD, a task-aware token pruning method that employs a separate router for each task to determine which tokens should be pruned. We apply our method to Show-o and Emu3, reducing training FLOPs by approximately 15% in Show-o and 40% in Emu3, while maintaining or improving performance on several benchmarks. Code will be released at https://github.com/showlab/UniMoD.

Related papers

Transforming Vision Transformer: Towards Efficient Multi-Task Asynchronous Learning [59.001091197106085]
Multi-Task Learning (MTL) for Vision Transformer aims at enhancing the model capability by tackling multiple tasks simultaneously.<n>Most recent works have predominantly focused on designing Mixture-of-Experts (MoE) structures and in tegrating Low-Rank Adaptation (LoRA) to efficiently perform multi-task learning.<n>We propose a novel approach dubbed Efficient Multi-Task Learning (EMTAL) by transforming a pre-trained Vision Transformer into an efficient multi-task learner.
arXiv Detail & Related papers (2025-01-12T17:41:23Z)
How Redundant Is the Transformer Stack in Speech Representation Models? [1.2699529713351287]
Self-supervised speech representation models have demonstrated remarkable performance across various tasks such as speech recognition, speaker identification, and emotion detection.<n>Recent studies on transformer models revealed a high redundancy between layers and the potential for significant pruning.<n>We demonstrate the effectiveness of pruning transformer-based speech representation models without the need for post-training.
arXiv Detail & Related papers (2024-09-10T11:00:24Z)
Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules [96.21649779507831]
We propose a novel architecture dubbed mixture-of-modules (MoM) MoM is motivated by an intuition that any layer, regardless of its position, can be used to compute a token. We show that MoM provides not only a unified framework for Transformers but also a flexible and learnable approach for reducing redundancy.
arXiv Detail & Related papers (2024-07-09T08:50:18Z)
What Matters in Transformers? Not All Attention is Needed [7.857824255138334]
Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks. It also introduces redundant architectures, posing efficiency challenges for real-world deployment. We investigate redundancy across different modules within Transformers, including Blocks, Attention layers, using a similarity-based metric.
arXiv Detail & Related papers (2024-06-22T08:41:48Z)
MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations. Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality. No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z)
DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense Prediction [40.447092963041236]
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer. Our method, named DeMT, is based on a simple and effective encoder-decoder architecture. Our model uses fewer GFLOPs and significantly outperforms current Transformer- and CNN-based competitive models.
arXiv Detail & Related papers (2023-01-09T16:00:15Z)
AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation [104.0979785739202]
Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in Neural Machine Translation (NMT) tasks. Existing MoE models mostly consider a homogeneous design where the same number of experts of the same size are placed uniformly throughout the network. We develop AutoMoE -- a framework for designing heterogeneous MoE's under computational constraints.
arXiv Detail & Related papers (2022-10-14T05:32:17Z)
MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks. Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z)
Ensemble Transformer for Efficient and Accurate Ranking Tasks: an Application to Question Answering Systems [99.13795374152997]
We propose a neural network designed to distill an ensemble of large transformers into a single smaller model. An MHS model consists of two components: a stack of transformer layers that is used to encode inputs, and a set of ranking heads. Unlike traditional distillation techniques, our approach leverages individual models in ensemble as teachers in a way that preserves the diversity of the ensemble members.
arXiv Detail & Related papers (2022-01-15T06:21:01Z)
DoT: An efficient Double Transformer for NLP tasks with tables [3.0079490585515343]
DoT is a double transformer model that decomposes the problem into two sub-tasks. We show that for a small drop of accuracy, DoT improves training and inference time by at least 50%.
arXiv Detail & Related papers (2021-06-01T13:33:53Z)
Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer [24.870827400461682]
We propose a Unified Transformer model to simultaneously learn the most prominent tasks across different domains. Based on the transformer encoder-decoder architecture, our UniT model encodes each input modality with an encoder and makes predictions on each task. The entire model is jointly trained end-to-end with losses from each task.
arXiv Detail & Related papers (2021-02-22T04:45:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.