Related papers: Mixture of Nested Experts: Adaptive Processing of Visual Tokens

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

URL: http://arxiv.org/abs/2407.19985v2
Date: Tue, 30 Jul 2024 17:26:22 GMT
Title: Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Authors: Gagan Jain, Nidhi Hegde, Aditya Kusupati, Arsha Nagrani, Shyamal Buch, Prateek Jain, Anurag Arnab, Sujoy Paul,
Abstract summary: Vision Transformer (ViT) based models fail to capitalize on inherent redundancy, leading to higher computational costs. We present Mixture of Nested Experts (MoNE), which utilizes a nested structure for experts, wherein individual experts fall on an increasing compute-accuracy curve. We validate our approach on standard image and video datasets - ImageNet-21K, Kinetics400, and Something-Something-v2.
Score: 49.43920770789789
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The visual medium (images and videos) naturally contains a large amount of information redundancy, thereby providing a great opportunity for leveraging efficiency in processing. While Vision Transformer (ViT) based models scale effectively to large data regimes, they fail to capitalize on this inherent redundancy, leading to higher computational costs. Mixture of Experts (MoE) networks demonstrate scalability while maintaining same inference-time costs, but they come with a larger parameter footprint. We present Mixture of Nested Experts (MoNE), which utilizes a nested structure for experts, wherein individual experts fall on an increasing compute-accuracy curve. Given a compute budget, MoNE learns to dynamically choose tokens in a priority order, and thus redundant tokens are processed through cheaper nested experts. Using this framework, we achieve equivalent performance as the baseline models, while reducing inference time compute by over two-fold. We validate our approach on standard image and video datasets - ImageNet-21K, Kinetics400, and Something-Something-v2. We further highlight MoNE$'$s adaptability by showcasing its ability to maintain strong performance across different inference-time compute budgets on videos, using only a single trained model.

Related papers

MambaMia: A State-Space-Model-Based Compression for Efficient Video Understanding in Large Multimodal Models [33.89483627891117]
We propose an efficient framework to compress multiple video-frame features before feeding them into large multimodal models.<n>Our framework emphasizes resource-conscious efficiency, making it practical for real-world deployments.
arXiv Detail & Related papers (2025-06-16T14:49:49Z)
Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\ exttt{D}}$ual-$\mathbf{\ exttt{H}}$ead $\mathbf{\ exttt{O}}$ptimization [49.2338910653152]
Vision-constrained models (VLMs) have achieved remarkable success across diverse tasks by leveraging rich textual information with minimal labeled data.<n> Knowledge distillation (KD) offers a well-established solution to this problem; however, recent KD approaches from VLMs often involve multi-stage training or additional tuning.<n>We propose $mathbftextttDHO$ -- a simple yet effective KD framework that transfers knowledge from VLMs to compact, task-specific models in semi-language settings.
arXiv Detail & Related papers (2025-05-12T15:39:51Z)
D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving [14.607254882119507]
Combination of experts (MoE) model is a sparse variant of large language models (LLMs) Despite its benefits, MoE is still too expensive to deploy on resource-constrained edge devices. We propose D$2$MoE, an algorithm-system co-design framework that matches diverse task requirements by dynamically allocating the most proper bit-width to each expert.
arXiv Detail & Related papers (2025-04-17T05:37:35Z)
Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition [6.168286187549952]
We propose a hybrid end-to-end framework that factorizes learning across three key concepts to reduce inference cost by $330times$ versus prior art. Experiments show that our method results in a lightweight architecture achieving state-of-the-art video recognition performance.
arXiv Detail & Related papers (2025-03-17T21:13:48Z)
Learning Free Token Reduction for Multi-Modal Large Language Models [3.4026156483879517]
Vision-Language Models (VLMs) have achieved remarkable success across a range of multimodal tasks. However, their practical deployment is often constrained by high computational costs and prolonged inference times. We propose a token compression paradigm that operates on both spatial and temporal dimensions.
arXiv Detail & Related papers (2025-01-29T02:52:32Z)
VICON: Vision In-Context Operator Networks for Multi-Physics Fluid Dynamics Prediction [21.061630022134203]
In-Context Operator Networks (ICONs) learn operators across different types of PDEs using a few-shot, in-context approach. Existing methods treat each data point as a single token, and suffer from computational inefficiency when processing dense data. We propose Vision In-Context Operator Networks (VICON), incorporating a vision transformer architecture that efficiently processes 2D functions through patch-wise operations.
arXiv Detail & Related papers (2024-11-25T03:25:17Z)
NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets. NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z)
Exploiting Distribution Constraints for Scalable and Efficient Image Retrieval [1.6874375111244329]
State-of-the-art image retrieval systems train specific neural networks for each dataset. Off-the-shelf foundation models fall short in achieving performance comparable to dataset-specific models. We introduce Autoencoders with Strong Variance Constraints (AE-SVC), which significantly improves the performance of foundation models.
arXiv Detail & Related papers (2024-10-09T16:05:16Z)
M$^2$IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension [36.01063804442098]
Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. PETL methods have shown strong performance with fewer tunable parameters. We present M$2$IST: Multi-Modal Interactive Side-Tuning with M$3$ISAs: Mixture of Multi-Modal Interactive Side-Adapters.
arXiv Detail & Related papers (2024-07-01T09:53:53Z)
Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations. A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization. We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z)
UniMatch: A Unified User-Item Matching Framework for the Multi-purpose Merchant Marketing [27.459774494479227]
We present a unified user-item matching framework to simultaneously conduct item recommendation and user targeting with just one model. Our framework results in significant performance gains in comparison with the state-of-the-art methods, with greatly reduced cost on computing resources and daily maintenance.
arXiv Detail & Related papers (2023-07-19T13:49:35Z)
Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion [54.33764537135906]
VideoQA Transformer models demonstrate competitive performance on standard benchmarks. Do these models capture the rich multimodal structures and dynamics from video and text jointly? Are they achieving high scores by exploiting biases and spurious features?
arXiv Detail & Related papers (2023-06-15T06:45:46Z)
Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs) We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z)
VA-RED$^2$: Video Adaptive Redundancy Reduction [64.75692128294175]
We present a redundancy reduction framework, VA-RED$2$, which is input-dependent. We learn the adaptive policy jointly with the network weights in a differentiable way with a shared-weight mechanism. Our framework achieves $20% - 40%$ reduction in computation (FLOPs) when compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-02-15T22:57:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.