AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One
- URL: http://arxiv.org/abs/2312.06709v5
- Date: Tue, 30 Apr 2024 22:22:03 GMT
- Title: AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One
- Authors: Mike Ranzinger, Greg Heinrich, Jan Kautz, Pavlo Molchanov,
- Abstract summary: We name this approach AM-RADIO (Agglomerative Model -- Reduce All Domains Into One)
We develop a novel architecture (E-RADIO) that exceeds the performance of its predecessors and is at least 7x faster than the teacher models.
Our comprehensive benchmarking process covers downstream tasks including ImageNet classification, ADE20k semantic segmentation, COCO object detection and LLaVa-1.5 framework.
- Score: 47.58919672657824
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A handful of visual foundation models (VFMs) have recently emerged as the backbones for numerous downstream tasks. VFMs like CLIP, DINOv2, SAM are trained with distinct objectives, exhibiting unique characteristics for various downstream tasks. We find that despite their conceptual differences, these models can be effectively merged into a unified model through multi-teacher distillation. We name this approach AM-RADIO (Agglomerative Model -- Reduce All Domains Into One). This integrative approach not only surpasses the performance of individual teacher models but also amalgamates their distinctive features, such as zero-shot vision-language comprehension, detailed pixel-level understanding, and open vocabulary segmentation capabilities. In pursuit of the most hardware-efficient backbone, we evaluated numerous architectures in our multi-teacher distillation pipeline using the same training recipe. This led to the development of a novel architecture (E-RADIO) that exceeds the performance of its predecessors and is at least 7x faster than the teacher models. Our comprehensive benchmarking process covers downstream tasks including ImageNet classification, ADE20k semantic segmentation, COCO object detection and LLaVa-1.5 framework. Code: https://github.com/NVlabs/RADIO
Related papers
- Leveraging Foundation Models via Knowledge Distillation in Multi-Object Tracking: Distilling DINOv2 Features to FairMOT [0.5999777817331317]
This work tries to leverage one such foundation model, called DINOv2, through using knowledge distillation.
The results imply that although the proposed method shows improvements in certain scenarios, it does not consistently outperform the original FairMOT model.
arXiv Detail & Related papers (2024-07-25T14:21:35Z) - OMG-Seg: Is One Model Good Enough For All Segmentation? [83.17068644513144]
OMG-Seg is a transformer-based encoder-decoder architecture with task-specific queries and outputs.
We show that OMG-Seg can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead.
arXiv Detail & Related papers (2024-01-18T18:59:34Z) - Gramian Attention Heads are Strong yet Efficient Vision Learners [26.79263390835444]
We introduce a novel architecture design that enhances expressiveness by incorporating multiple head classifiers (ie, classification heads)
Our approach employs attention-based aggregation, utilizing pairwise feature similarity to enhance multiple lightweight heads with minimal resource overhead.
Our models eventually surpass state-of-the-art CNNs and ViTs regarding the accuracy-grained trade-off on ImageNet-1K.
arXiv Detail & Related papers (2023-10-25T09:08:58Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - Progressive Volume Distillation with Active Learning for Efficient NeRF Architecture Conversion [27.389511043400635]
Neural Fields (NeRF) have been widely adopted as practical and versatile representations for 3D scenes.
We propose Progressive Volume Distillation with Active Learning (PVD-AL), a systematic distillation method.
PVD-AL decomposes each structure into two parts and progressively performs distillation from shallower to deeper volume representation.
arXiv Detail & Related papers (2023-04-08T13:59:18Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Multi-level Second-order Few-shot Learning [111.0648869396828]
We propose a Multi-level Second-order (MlSo) few-shot learning network for supervised or unsupervised few-shot image classification and few-shot action recognition.
We leverage so-called power-normalized second-order base learner streams combined with features that express multiple levels of visual abstraction.
We demonstrate respectable results on standard datasets such as Omniglot, mini-ImageNet, tiered-ImageNet, Open MIC, fine-grained datasets such as CUB Birds, Stanford Dogs and Cars, and action recognition datasets such as HMDB51, UCF101, and mini-MIT.
arXiv Detail & Related papers (2022-01-15T19:49:00Z) - Universal Representation Learning from Multiple Domains for Few-shot
Classification [41.821234589075445]
We propose to learn a single set of universal deep representations by distilling knowledge of multiple separately trained networks.
We show that the universal representations can be further refined for previously unseen domains by an efficient adaptation step.
arXiv Detail & Related papers (2021-03-25T13:49:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.