Omnivore: A Single Model for Many Visual Modalities
- URL: http://arxiv.org/abs/2201.08377v1
- Date: Thu, 20 Jan 2022 18:58:03 GMT
- Title: Omnivore: A Single Model for Many Visual Modalities
- Authors: Rohit Girdhar and Mannat Singh and Nikhila Ravi and Laurens van der
Maaten and Armand Joulin and Ishan Misra
- Abstract summary: Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data.
We propose a single model which excels at classifying images, videos, and single-view 3D data using exactly the same model parameters.
- Score: 47.94002558594031
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Prior work has studied different visual modalities in isolation and developed
separate architectures for recognition of images, videos, and 3D data. Instead,
in this paper, we propose a single model which excels at classifying images,
videos, and single-view 3D data using exactly the same model parameters. Our
'Omnivore' model leverages the flexibility of transformer-based architectures
and is trained jointly on classification tasks from different modalities.
Omnivore is simple to train, uses off-the-shelf standard datasets, and performs
at-par or better than modality-specific models of the same size. A single
Omnivore model obtains 86.0% on ImageNet, 84.1% on Kinetics, and 67.1% on SUN
RGB-D. After finetuning, our models outperform prior work on a variety of
vision tasks and generalize across modalities. Omnivore's shared visual
representation naturally enables cross-modal recognition without access to
correspondences between modalities. We hope our results motivate researchers to
model visual modalities together.
Related papers
- SUM: Saliency Unification through Mamba for Visual Attention Modeling [5.274826387442202]
Visual attention modeling plays a significant role in applications such as marketing, multimedia, and robotics.
Traditional saliency prediction models, especially those based on CNNs or Transformers, achieve notable success by leveraging large-scale annotated datasets.
In this paper, we propose Saliency Unification through Mamba (SUM), a novel approach that integrates the efficient long-range dependency modeling of Mamba with U-Net.
arXiv Detail & Related papers (2024-06-25T05:54:07Z) - Towards a Generalist and Blind RGB-X Tracker [91.36268768952755]
We develop a single model tracker that can remain blind to any modality X during inference time.
Our training process is extremely simple, integrating multi-label classification loss with a routing function.
Our generalist and blind tracker can achieve competitive performance compared to well-established modal-specific models.
arXiv Detail & Related papers (2024-05-28T03:00:58Z) - Pushing Boundaries: Exploring Zero Shot Object Classification with Large
Multimodal Models [0.09264362806173355]
Large Language and Vision Assistant models (LLVAs) engage users in rich conversational experiences intertwined with image-based queries.
This paper takes a unique perspective on LMMs, exploring their efficacy in performing image classification tasks using tailored prompts.
Our study includes a benchmarking analysis across four diverse datasets: MNIST, Cats Vs. Dogs, Hymnoptera (Ants Vs. Bees), and an unconventional dataset comprising Pox Vs. Non-Pox skin images.
arXiv Detail & Related papers (2023-12-30T03:19:54Z) - NViST: In the Wild New View Synthesis from a Single Image with Transformers [8.361847255300846]
We propose NViST, a transformer-based model for efficient novel-view synthesis from a single image.
NViST is trained on MVImgNet, a large-scale dataset of casually-captured real-world videos.
We show results on unseen objects and categories from MVImgNet and even generalization to casual phone captures.
arXiv Detail & Related papers (2023-12-13T23:41:17Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - Matcher: Segment Anything with One Shot Using All-Purpose Feature
Matching [63.88319217738223]
We present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks.
Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training.
Our results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild.
arXiv Detail & Related papers (2023-05-22T17:59:43Z) - OmniMAE: Single Model Masked Pretraining on Images and Videos [40.985481596672265]
Masked autoencoding can be used to train a simple Vision Transformer on images and videos.
We show that our single ViT-Huge model can be finetuned to achieve 86.6% on ImageNet and 75.5% on the challenging Something Something-v2 video benchmark.
arXiv Detail & Related papers (2022-06-16T17:57:01Z) - UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes [91.24112204588353]
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks.
In contrast to previous models, UViM has the same functional form for all tasks.
We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks.
arXiv Detail & Related papers (2022-05-20T17:47:59Z) - Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion
Recognition? [36.67937514793215]
Cross-modal attention is seen as an effective mechanism for multi-modal fusion.
We implement and compare a cross-attention and a self-attention model.
We compare the models using different modality combinations for a 7-class emotion classification task.
arXiv Detail & Related papers (2022-02-18T15:44:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.