Related papers: Vision Generalist Model: A Survey

Vision Generalist Model: A Survey

URL: http://arxiv.org/abs/2506.09954v1
Date: Wed, 11 Jun 2025 17:23:41 GMT
Title: Vision Generalist Model: A Survey
Authors: Ziyi Wang, Yongming Rao, Shuofeng Sun, Xinrun Liu, Yi Wei, Xumin Yu, Zuyan Liu, Yanbo Wang, Hongmin Liu, Jie Zhou, Jiwen Lu,
Abstract summary: We provide a comprehensive overview of the vision generalist models, delving into their characteristics and capabilities within the field.<n>We take a brief excursion into related domains, shedding light on their interconnections and potential synergies.
Score: 87.49797517847132
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, we have witnessed the great success of the generalist model in natural language processing. The generalist model is a general framework trained with massive data and is able to process various downstream tasks simultaneously. Encouraged by their impressive performance, an increasing number of researchers are venturing into the realm of applying these models to computer vision tasks. However, the inputs and outputs of vision tasks are more diverse, and it is difficult to summarize them as a unified representation. In this paper, we provide a comprehensive overview of the vision generalist models, delving into their characteristics and capabilities within the field. First, we review the background, including the datasets, tasks, and benchmarks. Then, we dig into the design of frameworks that have been proposed in existing research, while also introducing the techniques employed to enhance their performance. To better help the researchers comprehend the area, we take a brief excursion into related domains, shedding light on their interconnections and potential synergies. To conclude, we provide some real-world application scenarios, undertake a thorough examination of the persistent challenges, and offer insights into possible directions for future research endeavors.

Related papers

AGI-Elo: How Far Are We From Mastering A Task? [8.378767006620294]
This paper introduces a unified rating system that jointly models the difficulty of individual test cases and the competency of AI models (or humans) across vision, language, and action domains.<n>We validate the generalizability and robustness of our system through extensive experiments on multiple established datasets and models across distinct AGI domains.
arXiv Detail & Related papers (2025-05-19T08:30:13Z)
Explainability for Vision Foundation Models: A Survey [3.570403495760109]
Foundation models occupy an ambiguous position in the explainability domain.<n>Foundation models are characterized by their extensive generalization capabilities and emergent uses.<n>We discuss the challenges faced by current research in integrating XAI within foundation models.
arXiv Detail & Related papers (2025-01-21T15:18:55Z)
How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey [59.23394353614928]
In recent years, the rise of pre-trained models is driving the research on vision-language tasks.<n>Inspired by the powerful capabilities of pre-trained models, new paradigms have emerged to solve the classic challenges.
arXiv Detail & Related papers (2024-12-11T07:29:04Z)
Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models [24.579822095003685]
We conduct an empirical study on representation learning for downstream Visual Question Answering (VQA)<n>We thoroughly investigate the benefits and trade-offs of OC models and alternative approaches.<n>We identify a promising path to leverage the strengths of both paradigms.
arXiv Detail & Related papers (2024-07-22T12:26:08Z)
Transformers and Language Models in Form Understanding: A Comprehensive Review of Scanned Document Analysis [16.86139440201837]
We focus on the topic of form understanding in the context of scanned documents. Our research methodology involves an in-depth analysis of popular documents and forms of understanding of trends over the last decade. We showcase how transformers have propelled the field forward, revolutionizing form-understanding techniques.
arXiv Detail & Related papers (2024-03-06T22:22:02Z)
Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset. We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding. Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z)
Review of Large Vision Models and Visual Prompt Engineering [50.63394642549947]
Review aims to summarize the methods employed in the computer vision domain for large vision models and visual prompt engineering. We present influential large models in the visual domain and a range of prompt engineering methods employed on these models.
arXiv Detail & Related papers (2023-07-03T08:48:49Z)
A Comprehensive Survey on Segment Anything Model for Vision and Beyond [7.920790211915402]
It is urgent to design a general class of models, which we term foundation models, trained on broad data. The recently proposed segment anything model (SAM) has made significant progress in breaking the boundaries of segmentation. This paper introduces the background and terminology for foundation models including SAM, as well as state-of-the-art methods contemporaneous with SAM.
arXiv Detail & Related papers (2023-05-14T16:23:22Z)
Summary-Oriented Vision Modeling for Multimodal Abstractive Summarization [63.320005222549646]
Multimodal abstractive summarization (MAS) aims to produce a concise summary given the multimodal data (text and vision) We propose to improve the summary quality through summary-oriented visual features. Experiments on 44 languages, covering mid-high, low-, and zero-resource scenarios, verify the effectiveness and superiority of the proposed approach.
arXiv Detail & Related papers (2022-12-15T09:05:26Z)
Causal Reasoning Meets Visual Representation Learning: A Prospective Study [117.08431221482638]
Lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models. Inspired by the strong inference ability of human-level agents, recent years have witnessed great effort in developing causal reasoning paradigms. This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods.
arXiv Detail & Related papers (2022-04-26T02:22:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.