Towards the Unification of Generative and Discriminative Visual
Foundation Model: A Survey
- URL: http://arxiv.org/abs/2312.10163v1
- Date: Fri, 15 Dec 2023 19:17:15 GMT
- Title: Towards the Unification of Generative and Discriminative Visual
Foundation Model: A Survey
- Authors: Xu Liu, Tong Zhou, Yuanxin Wang, Yuping Wang, Qinjingwen Cao, Weizhi
Du, Yonghuan Yang, Junjun He, Yu Qiao, Yiqing Shen
- Abstract summary: Visual foundation models (VFMs) have become a catalyst for groundbreaking developments in computer vision.
This review paper delineates the pivotal trajectories of VFMs, emphasizing their scalability and proficiency in generative tasks.
A crucial direction for forthcoming innovation is the amalgamation of generative and discriminative paradigms.
- Score: 30.528346074194925
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advent of foundation models, which are pre-trained on vast datasets, has
ushered in a new era of computer vision, characterized by their robustness and
remarkable zero-shot generalization capabilities. Mirroring the transformative
impact of foundation models like large language models (LLMs) in natural
language processing, visual foundation models (VFMs) have become a catalyst for
groundbreaking developments in computer vision. This review paper delineates
the pivotal trajectories of VFMs, emphasizing their scalability and proficiency
in generative tasks such as text-to-image synthesis, as well as their adeptness
in discriminative tasks including image segmentation. While generative and
discriminative models have historically charted distinct paths, we undertake a
comprehensive examination of the recent strides made by VFMs in both domains,
elucidating their origins, seminal breakthroughs, and pivotal methodologies.
Additionally, we collate and discuss the extensive resources that facilitate
the development of VFMs and address the challenges that pave the way for future
research endeavors. A crucial direction for forthcoming innovation is the
amalgamation of generative and discriminative paradigms. The nascent
application of generative models within discriminative contexts signifies the
early stages of this confluence. This survey aspires to be a contemporary
compendium for scholars and practitioners alike, charting the course of VFMs
and illuminating their multifaceted landscape.
Related papers
- From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models [56.9134620424985]
Cross-modal reasoning (CMR) is increasingly recognized as a crucial capability in the progression toward more sophisticated artificial intelligence systems.
The recent trend of deploying Large Language Models (LLMs) to tackle CMR tasks has marked a new mainstream of approaches for enhancing their effectiveness.
This survey offers a nuanced exposition of current methodologies applied in CMR using LLMs, classifying these into a detailed three-tiered taxonomy.
arXiv Detail & Related papers (2024-09-19T02:51:54Z) - Heterogeneous Contrastive Learning for Foundation Models and Beyond [73.74745053250619]
In the era of big data and Artificial Intelligence, an emerging paradigm is to utilize contrastive self-supervised learning to model large-scale heterogeneous data.
This survey critically evaluates the current landscape of heterogeneous contrastive learning for foundation models.
arXiv Detail & Related papers (2024-03-30T02:55:49Z) - Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions [11.786387517781328]
Vision-Language Models (VLMs) are advanced models that can tackle more intricate tasks such as image captioning and visual question answering.
Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.
We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible.
arXiv Detail & Related papers (2024-02-20T18:57:34Z) - A Survey for Foundation Models in Autonomous Driving [10.315409708116865]
Large language models contribute to planning and simulation in autonomous driving.
vision foundation models are increasingly adapted for critical tasks such as 3D object detection and tracking.
Multi-modal foundation models, integrating diverse inputs, exhibit exceptional visual understanding and spatial reasoning.
arXiv Detail & Related papers (2024-02-02T02:44:59Z) - Bridging Generative and Discriminative Models for Unified Visual
Perception with Diffusion Priors [56.82596340418697]
We propose a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an adapted expert providing discriminative priors.
Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages.
The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.
arXiv Detail & Related papers (2024-01-29T10:36:57Z) - Towards Graph Foundation Models: A Survey and Beyond [66.37994863159861]
Foundation models have emerged as critical components in a variety of artificial intelligence applications.
The capabilities of foundation models to generalize and adapt motivate graph machine learning researchers to discuss the potential of developing a new graph learning paradigm.
This article introduces the concept of Graph Foundation Models (GFMs), and offers an exhaustive explanation of their key characteristics and underlying technologies.
arXiv Detail & Related papers (2023-10-18T09:31:21Z) - Foundation Models Meet Visualizations: Challenges and Opportunities [23.01218856618978]
This paper divides visualizations for foundation models (VIS4FM) and foundation models for visualizations (FM4VIS)
In VIS4FM, we explore the primary role of visualizations in understanding, refining, and evaluating these intricate models.
In FM4VIS, we highlight how foundation models can be utilized to advance the visualization field itself.
arXiv Detail & Related papers (2023-10-09T14:57:05Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z) - Causal Reasoning Meets Visual Representation Learning: A Prospective
Study [117.08431221482638]
Lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models.
Inspired by the strong inference ability of human-level agents, recent years have witnessed great effort in developing causal reasoning paradigms.
This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods.
arXiv Detail & Related papers (2022-04-26T02:22:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.