A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
- URL: http://arxiv.org/abs/2312.12436v2
- Date: Wed, 20 Dec 2023 12:40:47 GMT
- Title: A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
- Authors: Chaoyou Fu, Renrui Zhang, Zihan Wang, Yubo Huang, Zhengye Zhang,
Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Mengdan Zhang, Peixian Chen, Sirui
Zhao, Shaohui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Hongsheng Li, Xing
Sun
- Abstract summary: Gemini is Google's newest and most capable MLLM built from the ground up for multi-modality.
Can Gemini challenge GPT-4V's leading position in multi-modal learning?
We compare Gemini Pro with the state-of-the-art GPT-4V to evaluate its upper limits, along with the latest open-sourced MLLM, Sphinx.
- Score: 78.54563675327198
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The surge of interest towards Multi-modal Large Language Models (MLLMs),
e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both
academia and industry. They endow Large Language Models (LLMs) with powerful
capabilities in visual understanding, enabling them to tackle diverse
multi-modal tasks. Very recently, Google released Gemini, its newest and most
capable MLLM built from the ground up for multi-modality. In light of the
superior reasoning capabilities, can Gemini challenge GPT-4V's leading position
in multi-modal learning? In this paper, we present a preliminary exploration of
Gemini Pro's visual understanding proficiency, which comprehensively covers
four domains: fundamental perception, advanced cognition, challenging vision
tasks, and various expert capacities. We compare Gemini Pro with the
state-of-the-art GPT-4V to evaluate its upper limits, along with the latest
open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and
black-box systems. The qualitative samples indicate that, while GPT-4V and
Gemini showcase different answering styles and preferences, they can exhibit
comparable visual reasoning capabilities, and Sphinx still trails behind them
concerning domain generalizability. Specifically, GPT-4V tends to elaborate
detailed explanations and intermediate steps, and Gemini prefers to output a
direct and concise answer. The quantitative evaluation on the popular MME
benchmark also demonstrates the potential of Gemini to be a strong challenger
to GPT-4V. Our early investigation of Gemini also observes some common issues
of MLLMs, indicating that there still remains a considerable distance towards
artificial general intelligence. Our project for tracking the progress of MLLM
is released at
https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.
Related papers
- Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models [22.545127591893028]
Multimodal Large Language Models (MLLMs) such as GPT-4V and Gemini Pro face challenges in achieving human-level perception in Visual Question Answering (VQA)
This is mainly due to their limited capability to effectively integrate complex visual cues with textual information and potential object hallucinations.
We present a novel approach, Joint Visual and Text Prompting (VTPrompt), that employs fine-grained visual information to enhance the capability of MLLMs in VQA.
arXiv Detail & Related papers (2024-04-06T05:59:02Z) - From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on
Generalizability, Trustworthiness and Causality through Four Modalities [111.44485171421535]
We study the generalizability, trustworthiness, and causal reasoning capabilities of recent proprietary and open-source MLLMs across four modalities.
We believe these properties are several representative factors that define the reliability of MLLMs.
We uncover 14 empirical findings that are useful to understand the capabilities and limitations of both proprietary and open-source MLLMs.
arXiv Detail & Related papers (2024-01-26T18:53:03Z) - Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language
Models [14.30980373935713]
Google introduced Gemini, a cutting-edge MLLM designed specifically for multimodal integration.
Despite its advancements, preliminary benchmarks indicate that Gemini lags behind GPT models in commonsense reasoning tasks.
This study undertakes a thorough evaluation of Gemini's performance in complex reasoning tasks.
arXiv Detail & Related papers (2023-12-29T15:57:49Z) - Gemini vs GPT-4V: A Preliminary Comparison and Combination of
Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision)
The core of our analysis delves into the distinct visual comprehension abilities of each model.
Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z) - Gemini: A Family of Highly Capable Multimodal Models [629.0779987066369]
New family of multimodal models, Gemini, exhibit remarkable capabilities across image, audio, video, and text understanding.
The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases.
arXiv Detail & Related papers (2023-12-19T02:39:27Z) - An In-depth Look at Gemini's Language Abilities [49.897870833250494]
We compare the abilities of the OpenAI GPT and Google Gemini models.
We perform this analysis over 10 datasets testing a variety of language abilities.
We find that Gemini Pro achieves accuracy that is close but slightly inferior to the corresponding GPT 3.5 Turbo.
arXiv Detail & Related papers (2023-12-18T18:47:42Z) - The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) [121.42924593374127]
We analyze the latest model, GPT-4V, to deepen the understanding of LMMs.
GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs makes it a powerful multimodal generalist system.
GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods.
arXiv Detail & Related papers (2023-09-29T17:34:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.