Related papers: Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent

Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent

URL: http://arxiv.org/abs/2404.11459v2
Date: Thu, 18 Apr 2024 07:32:52 GMT
Title: Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent
Authors: Wei Chen, Zhiyuan Li,
Abstract summary: A multimodal AI agent is characterized by its ability to process and learn from various types of data. We introduce a multimodal model that incorporates the concept of functional token specifically designed for AI agent applications. We demonstrate that this model is capable of operating efficiently on a wide range of edge devices, including as constrained as a Raspberry Pi.
Score: 10.998608318944985
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: A multimodal AI agent is characterized by its ability to process and learn from various types of data, including natural language, visual, and audio inputs, to inform its actions. Despite advancements in large language models that incorporate visual data, such as GPT-4V, effectively translating image-based data into actionable outcomes for AI agents continues to be challenging. In this paper, we introduce a multimodal model that incorporates the concept of functional token specifically designed for AI agent applications. To ensure compatibility with edge devices, our model is optimized to a compact size of less than 1B parameters. Like GPT-4, our model can process both English and Chinese. We demonstrate that this model is capable of operating efficiently on a wide range of edge devices, including as constrained as a Raspberry Pi.

Related papers

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs [195.24565517943802]
We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion- parameter language model trained on high-quality web and synthetic data. Phi-4-Multimodal is a multimodal model that integrates text, vision, and speech/audio input modalities into a single model.
arXiv Detail & Related papers (2025-03-03T17:05:52Z)
Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision [50.23246260804145]
We introduce textbfNexus-O, an industry-level textbfomni-perceptive and -interactive model capable of efficiently processing Audio, Image, Video, and Text data. We address three key research questions: First, how can models be efficiently designed and trained to achieve tri-modal alignment, understanding and reasoning capabilities across multiple modalities? Second, what approaches can be implemented to evaluate tri-modal model robustness, ensuring reliable performance and applicability in real-world scenarios? Third, what strategies can be employed to curate and obtain high-quality, real-life scenario
arXiv Detail & Related papers (2025-02-26T17:26:36Z)
AI-Spectra: A Visual Dashboard for Model Multiplicity to Enhance Informed and Transparent Decision-Making [1.860042727037436]
We present an approach, AI-Spectra, to leverage model multiplicity for interactive systems. Model multiplicity means using slightly different AI models yielding equally valid outcomes or predictions for the same task. We use a custom adaptation of Chernoff faces for AI-Spectra; Chernoff Bots.
arXiv Detail & Related papers (2024-11-14T18:50:41Z)
Towards Multi-Modal Mastery: A 4.5B Parameter Truly Multi-Modal Small Language Model [0.0]
We present a novel 4.5B parameter small language model that can handle multiple input and output modalities. Despite its small size, the model achieves near state-of-the-art performance on a variety of tasks.
arXiv Detail & Related papers (2024-11-08T17:15:17Z)
The Llama 3 Herd of Models [356.6353861669039]
This paper presents a new set of foundation models, called Llama 3. Llama 3 is a herd of language models that support multilinguality, coding, reasoning, and tool usage. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks.
arXiv Detail & Related papers (2024-07-31T17:54:27Z)
LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models [50.259006481656094]
We present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models. Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer. We present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.
arXiv Detail & Related papers (2024-04-03T23:57:34Z)
Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks [0.0]
In this work, we focus on unsupervised vision-language--action mapping in the area of robotic manipulation. We propose a model-invariant training alternative that improves the models' performance in a simulator by up to 55%. Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories.
arXiv Detail & Related papers (2024-04-02T13:25:16Z)
An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z)
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals. UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks. Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.