Sequential Modeling Enables Scalable Learning for Large Vision Models
- URL: http://arxiv.org/abs/2312.00785v1
- Date: Fri, 1 Dec 2023 18:59:57 GMT
- Title: Sequential Modeling Enables Scalable Learning for Large Vision Models
- Authors: Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille,
Trevor Darrell, Jitendra Malik, Alexei A Efros
- Abstract summary: We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data.
We define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources.
- Score: 120.91839619284431
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a novel sequential modeling approach which enables learning a
Large Vision Model (LVM) without making use of any linguistic data. To do this,
we define a common format, "visual sentences", in which we can represent raw
images and videos as well as annotated data sources such as semantic
segmentations and depth reconstructions without needing any meta-knowledge
beyond the pixels. Once this wide variety of visual data (comprising 420
billion tokens) is represented as sequences, the model can be trained to
minimize a cross-entropy loss for next token prediction. By training across
various scales of model architecture and data diversity, we provide empirical
evidence that our models scale effectively. Many different vision tasks can be
solved by designing suitable visual prompts at test time.
Related papers
- Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception
Tasks? [51.22096780511165]
We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations.
We feed detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images.
arXiv Detail & Related papers (2023-06-01T14:02:45Z) - LLM2Loss: Leveraging Language Models for Explainable Model Diagnostics [5.33024001730262]
We propose an approach that can provide semantic insights into a model's patterns of failures and biases.
We show that an ensemble of such lightweight models can be used to generate insights on the performance of the black-box model.
arXiv Detail & Related papers (2023-05-04T23:54:37Z) - Vision Models Are More Robust And Fair When Pretrained On Uncurated
Images Without Supervision [38.22842778742829]
Discriminative self-supervised learning allows training models on any random group of internet images.
We train models on billions of random images without any data pre-processing or prior assumptions about what we want the model to learn.
We extensively study and validate our model performance on over 50 benchmarks including fairness, to distribution shift, geographical diversity, fine grained recognition, image copy detection and many image classification datasets.
arXiv Detail & Related papers (2022-02-16T22:26:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.