Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
- URL: http://arxiv.org/abs/2404.12387v1
- Date: Thu, 18 Apr 2024 17:59:48 GMT
- Title: Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
- Authors: Aitor Ormazabal, Che Zheng, Cyprien de Masson d'Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, Kaloyan Aleksiev, Lei Li, Matthew Henderson, Max Bain, Mikel Artetxe, Nishant Relan, Piotr Padlewski, Qi Liu, Ren Chen, Samuel Phua, Yazheng Yang, Yi Tay, Yuqi Wang, Zhongkai Zhu, Zhihui Xie,
- Abstract summary: Reka models are able to process and reason with text, images, video, and audio inputs.
We show that Reka Edge and Reka Flash are not only state-of-the-art but also outperform many much larger models.
Our most capable and largest model, Reka Core, approaches the best frontier models on both automatic evaluations and blind human evaluations.
- Score: 69.35758259895259
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Reka Core, Flash, and Edge, a series of powerful multimodal language models trained from scratch by Reka. Reka models are able to process and reason with text, images, video, and audio inputs. This technical report discusses details of training some of these models and provides comprehensive evaluation results. We show that Reka Edge and Reka Flash are not only state-of-the-art but also outperform many much larger models, delivering outsized values for their respective compute class. Meanwhile, our most capable and largest model, Reka Core, approaches the best frontier models on both automatic evaluations and blind human evaluations. On image question answering benchmarks (e.g. MMMU, VQAv2), Core performs competitively to GPT4-V. Meanwhile, on multimodal chat, Core ranks as the second most preferred model under a blind third-party human evaluation setup, outperforming other models such as Claude 3 Opus. On text benchmarks, Core not only performs competitively to other frontier models on a set of well-established benchmarks (e.g. MMLU, GSM8K) but also outperforms GPT4-0613 on human evaluation. On video question answering (Perception-Test), Core outperforms Gemini Ultra. Models are shipped in production at http://chat.reka.ai . A showcase of non cherry picked qualitative examples can also be found at http://showcase.reka.ai .
Related papers
- Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models [14.878276985702685]
In this paper, we bring a recently developed evaluation paradigm from text models to multimodal models.
We define games that challenge a model's capability to represent a situation from visual information and align such representations through dialogue.
We find that the largest closed models perform rather well on the games that we define, while even the best open-weight models struggle with them.
arXiv Detail & Related papers (2024-06-20T06:56:19Z) - Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models [67.62126108440003]
We introduce Vibe-Eval: a new open benchmark and framework for evaluating multimodal chat models.
Vibe-Eval consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts.
We discuss trade-offs between human and automatic evaluation, and show that automatic model evaluation using Reka Core roughly correlates to human judgment.
arXiv Detail & Related papers (2024-05-03T17:59:55Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z) - MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
How to evaluate large vision-language models remains a major obstacle, hindering future model development.
Traditional benchmarks provide quantitative performance measurements but suffer from a lack of fine-grained ability assessment and non-robust evaluation metrics.
Recent subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, but they are not scalable and display significant bias.
MMBench is a systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models.
arXiv Detail & Related papers (2023-07-12T16:23:09Z) - Data-Efficient French Language Modeling with CamemBERTa [0.0]
We introduce CamemBERTa, a French DeBERTa model that builds upon the DeBERTaV3 architecture and training objective.
We evaluate our model's performance on a variety of French downstream tasks and datasets.
arXiv Detail & Related papers (2023-06-02T12:45:34Z) - Composing Ensembles of Pre-trained Models via Iterative Consensus [95.10641301155232]
We propose a unified framework for composing ensembles of different pre-trained models.
We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization.
We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer.
arXiv Detail & Related papers (2022-10-20T18:46:31Z) - What Language Model Architecture and Pretraining Objective Work Best for
Zero-Shot Generalization? [50.84738303888189]
We present a large-scale evaluation of modeling choices and their impact on zero-shot generalization.
We train models with over 5 billion parameters for more than 170 billion tokens.
We find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models.
arXiv Detail & Related papers (2022-04-12T14:19:49Z) - Towards Trustworthy Deception Detection: Benchmarking Model Robustness
across Domains, Modalities, and Languages [10.131671217810581]
We evaluate model robustness to out-of-domain data, modality-specific features, and languages other than English.
We find that with additional image content as input, ELMo embeddings yield significantly fewer errors compared to BERT orGLoVe.
arXiv Detail & Related papers (2021-04-23T18:05:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.