Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective
- URL: http://arxiv.org/abs/2410.22217v2
- Date: Wed, 30 Oct 2024 17:51:26 GMT
- Title: Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective
- Authors: Shenghao Xie, Wenqiang Zu, Mingyang Zhao, Duo Su, Shilong Liu, Ruohua Shi, Guoqi Li, Shanghang Zhang, Lei Ma,
- Abstract summary: We review the recent advances and discuss future directions for autoregressive vision foundation models.
We present the trend for next generation of vision foundation models, unifying both understanding and generation in vision tasks.
We categorize autoregressive vision foundation models from their vision tokenizers and autoregression backbones.
- Score: 31.527120945663725
- License:
- Abstract: Autoregression in large language models (LLMs) has shown impressive scalability by unifying all language tasks into the next token prediction paradigm. Recently, there is a growing interest in extending this success to vision foundation models. In this survey, we review the recent advances and discuss future directions for autoregressive vision foundation models. First, we present the trend for next generation of vision foundation models, i.e., unifying both understanding and generation in vision tasks. We then analyze the limitations of existing vision foundation models, and present a formal definition of autoregression with its advantages. Later, we categorize autoregressive vision foundation models from their vision tokenizers and autoregression backbones. Finally, we discuss several promising research challenges and directions. To the best of our knowledge, this is the first survey to comprehensively summarize autoregressive vision foundation models under the trend of unifying understanding and generation. A collection of related resources is available at https://github.com/EmmaSRH/ARVFM.
Related papers
- ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models [55.07988373824348]
We study the visual generalization capabilities of three existing robotic foundation models.
Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios.
We propose a gradual backbone reversal approach founded on model merging.
arXiv Detail & Related papers (2024-09-23T17:47:59Z) - Heterogeneous Contrastive Learning for Foundation Models and Beyond [73.74745053250619]
In the era of big data and Artificial Intelligence, an emerging paradigm is to utilize contrastive self-supervised learning to model large-scale heterogeneous data.
This survey critically evaluates the current landscape of heterogeneous contrastive learning for foundation models.
arXiv Detail & Related papers (2024-03-30T02:55:49Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Towards the Unification of Generative and Discriminative Visual
Foundation Model: A Survey [30.528346074194925]
Visual foundation models (VFMs) have become a catalyst for groundbreaking developments in computer vision.
This review paper delineates the pivotal trajectories of VFMs, emphasizing their scalability and proficiency in generative tasks.
A crucial direction for forthcoming innovation is the amalgamation of generative and discriminative paradigms.
arXiv Detail & Related papers (2023-12-15T19:17:15Z) - Foundation Models Meet Visualizations: Challenges and Opportunities [23.01218856618978]
This paper divides visualizations for foundation models (VIS4FM) and foundation models for visualizations (FM4VIS)
In VIS4FM, we explore the primary role of visualizations in understanding, refining, and evaluating these intricate models.
In FM4VIS, we highlight how foundation models can be utilized to advance the visualization field itself.
arXiv Detail & Related papers (2023-10-09T14:57:05Z) - Graph Meets LLMs: Towards Large Graph Models [60.24970313736175]
We present a perspective paper to discuss the challenges and opportunities associated with developing large graph models.
First, we discuss the desired characteristics of large graph models.
Then, we present detailed discussions from three key perspectives: representation basis, graph data, and graph models.
arXiv Detail & Related papers (2023-08-28T12:17:51Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z) - Foundation models in brief: A historical, socio-technical focus [2.5991265608180396]
Foundation models can be disruptive for future AI development by scaling up deep learning.
Models achieve state-of-the-art performance on a variety of tasks in domains such as natural language processing and computer vision.
arXiv Detail & Related papers (2022-12-17T22:11:33Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.