Towards Multi-Modal Mastery: A 4.5B Parameter Truly Multi-Modal Small Language Model
- URL: http://arxiv.org/abs/2411.05903v1
- Date: Fri, 08 Nov 2024 17:15:17 GMT
- Title: Towards Multi-Modal Mastery: A 4.5B Parameter Truly Multi-Modal Small Language Model
- Authors: Ben Koska, Mojmír Horváth,
- Abstract summary: We present a novel 4.5B parameter small language model that can handle multiple input and output modalities.
Despite its small size, the model achieves near state-of-the-art performance on a variety of tasks.
- Score: 0.0
- License:
- Abstract: We present a novel 4.5B parameter small language model that can handle multiple input and output modalities, including text, images, videos, and audio. Despite its small size, the model achieves near state-of-the-art performance on a variety of tasks, demonstrating the potential of multi-modal models to tackle complex real-world problems. Our approach leverages recent advancements in language modeling and multi-task learning to create a versatile and high-performing model that can even be deployed for edge inference. Experimental results show the model's strong performance across multiple benchmarks, paving the way for further progress in multi-modal artificial intelligence.
Related papers
- 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities [17.374241865041856]
We show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance.
We successfully scale the training to a three billion parameter model using tens of modalities and different datasets.
The resulting models and training code are open sourced at 4m.epfl.ch.
arXiv Detail & Related papers (2024-06-13T17:59:42Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - 4M: Massively Multimodal Masked Modeling [20.69496647914175]
Current machine learning models for vision are often highly specialized and limited to a single modality and task.
Recent large language models exhibit a wide range of capabilities, hinting at a possibility for similarly versatile models in computer vision.
We propose a multimodal training scheme called 4M for training versatile and scalable foundation models for vision tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - A Visual Tour Of Current Challenges In Multimodal Language Models [24.083086685623247]
Multimodal learning can overcome the challenges of learning effective word representations for function words.
We find that stable diffusion models effectively model only a small fraction of function words.
arXiv Detail & Related papers (2022-10-22T22:53:55Z) - MultiViz: An Analysis Benchmark for Visualizing and Understanding
Multimodal Models [103.9987158554515]
MultiViz is a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages.
We show that the complementary stages in MultiViz together enable users to simulate model predictions, assign interpretable concepts to features, perform error analysis on model misclassifications, and use insights from error analysis to debug models.
arXiv Detail & Related papers (2022-06-30T18:42:06Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.