An In-depth Look at Gemini's Language Abilities
- URL: http://arxiv.org/abs/2312.11444v2
- Date: Sun, 24 Dec 2023 12:25:10 GMT
- Title: An In-depth Look at Gemini's Language Abilities
- Authors: Syeda Nahida Akter, Zichun Yu, Aashiq Muhamed, Tianyue Ou, Alex
B\"auerle, \'Angel Alexander Cabrera, Krish Dholakia, Chenyan Xiong, Graham
Neubig
- Abstract summary: We compare the abilities of the OpenAI GPT and Google Gemini models.
We perform this analysis over 10 datasets testing a variety of language abilities.
We find that Gemini Pro achieves accuracy that is close but slightly inferior to the corresponding GPT 3.5 Turbo.
- Score: 49.897870833250494
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recently released Google Gemini class of models are the first to
comprehensively report results that rival the OpenAI GPT series across a wide
variety of tasks. In this paper, we do an in-depth exploration of Gemini's
language abilities, making two contributions. First, we provide a third-party,
objective comparison of the abilities of the OpenAI GPT and Google Gemini
models with reproducible code and fully transparent results. Second, we take a
closer look at the results, identifying areas where one of the two model
classes excels. We perform this analysis over 10 datasets testing a variety of
language abilities, including reasoning, answering knowledge-based questions,
solving math problems, translating between languages, generating code, and
acting as instruction-following agents. From this analysis, we find that Gemini
Pro achieves accuracy that is close but slightly inferior to the corresponding
GPT 3.5 Turbo on all tasks that we benchmarked. We further provide explanations
for some of this under-performance, including failures in mathematical
reasoning with many digits, sensitivity to multiple-choice answer ordering,
aggressive content filtering, and others. We also identify areas where Gemini
demonstrates comparably high performance, including generation into non-English
languages, and handling longer and more complex reasoning chains. Code and data
for reproduction can be found at https://github.com/neulab/gemini-benchmark
Related papers
- Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language
Models [14.30980373935713]
Google introduced Gemini, a cutting-edge MLLM designed specifically for multimodal integration.
Despite its advancements, preliminary benchmarks indicate that Gemini lags behind GPT models in commonsense reasoning tasks.
This study undertakes a thorough evaluation of Gemini's performance in complex reasoning tasks.
arXiv Detail & Related papers (2023-12-29T15:57:49Z) - A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise [78.54563675327198]
Gemini is Google's newest and most capable MLLM built from the ground up for multi-modality.
Can Gemini challenge GPT-4V's leading position in multi-modal learning?
We compare Gemini Pro with the state-of-the-art GPT-4V to evaluate its upper limits, along with the latest open-sourced MLLM, Sphinx.
arXiv Detail & Related papers (2023-12-19T18:59:22Z) - Gemini: A Family of Highly Capable Multimodal Models [629.0779987066369]
New family of multimodal models, Gemini, exhibit remarkable capabilities across image, audio, video, and text understanding.
The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases.
arXiv Detail & Related papers (2023-12-19T02:39:27Z) - Deep Bidirectional Language-Knowledge Graph Pretraining [159.9645181522436]
DRAGON is a self-supervised approach to pretraining a deeply joint language-knowledge foundation model from text and KG at scale.
Our model takes pairs of text segments and relevant KG subgraphs as input and bidirectionally fuses information from both modalities.
arXiv Detail & Related papers (2022-10-17T18:02:52Z) - MuRAG: Multimodal Retrieval-Augmented Generator for Open Question
Answering over Images and Text [58.655375327681774]
We propose the first Multimodal Retrieval-Augmented Transformer (MuRAG)
MuRAG accesses an external non-parametric multimodal memory to augment language generation.
Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets.
arXiv Detail & Related papers (2022-10-06T13:58:03Z) - COM2SENSE: A Commonsense Reasoning Benchmark with Complementary
Sentences [21.11065466376105]
Commonsense reasoning is intuitive for humans but has been a long-term challenge for artificial intelligence (AI)
Recent advancements in pretrained language models have shown promising results on several commonsense benchmark datasets.
We introduce a new commonsense reasoning benchmark dataset comprising natural language true/false statements.
arXiv Detail & Related papers (2021-06-02T06:31:55Z) - Language Models are Few-Shot Learners [61.36677350504291]
We show that scaling up language models greatly improves task-agnostic, few-shot performance.
We train GPT-3, an autoregressive language model with 175 billion parameters, and test its performance in the few-shot setting.
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks.
arXiv Detail & Related papers (2020-05-28T17:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.