OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents
- URL: http://arxiv.org/abs/2408.03047v1
- Date: Tue, 6 Aug 2024 09:02:53 GMT
- Title: OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents
- Authors: Qiang Sun, Yuanyi Luo, Sirui Li, Wenxiao Zhang, Wei Liu,
- Abstract summary: Open Omni is an open-source, end-to-end pipeline benchmarking tool.
It integrates advanced technologies such as Speech-to-Text, Emotion Detection, Retrieval Augmented Generation, Large Language Models.
It supports local and cloud deployment, ensuring data privacy and supporting latency and accuracy benchmarking.
- Score: 11.928422245125985
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal conversational agents are highly desirable because they offer natural and human-like interaction. However, there is a lack of comprehensive end-to-end solutions to support collaborative development and benchmarking. While proprietary systems like GPT-4o and Gemini demonstrating impressive integration of audio, video, and text with response times of 200-250ms, challenges remain in balancing latency, accuracy, cost, and data privacy. To better understand and quantify these issues, we developed OpenOmni, an open-source, end-to-end pipeline benchmarking tool that integrates advanced technologies such as Speech-to-Text, Emotion Detection, Retrieval Augmented Generation, Large Language Models, along with the ability to integrate customized models. OpenOmni supports local and cloud deployment, ensuring data privacy and supporting latency and accuracy benchmarking. This flexible framework allows researchers to customize the pipeline, focusing on real bottlenecks and facilitating rapid proof-of-concept development. OpenOmni can significantly enhance applications like indoor assistance for visually impaired individuals, advancing human-computer interaction. Our demonstration video is available https://www.youtube.com/watch?v=zaSiT3clWqY, demo is available via https://openomni.ai4wa.com, code is available via https://github.com/AI4WA/OpenOmniFramework.
Related papers
- CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR [31.49021749468963]
Large language model (LLM)powered non-player characters (NPCs) with speech-to-text (STT) and text-to-speech (TTS) models bring significant advantages over conventional or pre-scripted NPCs for facilitating more natural conversational user interfaces (CUIs) in XR.
We provide the community with an open-source, customizable, and privacy-aware Unity package, CUIfy, that facilitates speech-based NPC-user interaction with various LLMs, STT, and TTS models.
arXiv Detail & Related papers (2024-11-07T12:55:17Z) - Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities [0.0]
Mini- Omni2 is a visual-audio assistant capable of providing real-time, end-to-end voice responses to visoin and audio queries.
We propose a three-stage training process to align modalities, allowing the language model to handle multi-modal inputs and outputs after training on a limited dataset.
arXiv Detail & Related papers (2024-10-15T02:10:45Z) - OpenHands: An Open Platform for AI Software Developers as Generalist Agents [109.8507367518992]
We introduce OpenHands, a platform for the development of AI agents that interact with the world in similar ways to a human developer.
We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, and incorporation of evaluation benchmarks.
arXiv Detail & Related papers (2024-07-23T17:50:43Z) - MeMemo: On-device Retrieval Augmentation for Private and Personalized Text Generation [36.50320728984937]
We introduce MeMemo, the first open-source JavaScript toolkit that adapts the state-of-the-art approximate nearest neighbor search technique HNSW to browser environments.
MeMemo enables exciting new design and research opportunities, such as private and personalized content creation and interactive prototyping.
arXiv Detail & Related papers (2024-07-02T06:08:55Z) - LEGENT: Open Platform for Embodied Agents [60.71847900126832]
We introduce LEGENT, an open, scalable platform for developing embodied agents using Large Language Models (LLMs) and Large Multimodal Models (LMMs)
LEGENT offers a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface.
In experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks.
arXiv Detail & Related papers (2024-04-28T16:50:12Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z) - SoTaNa: The Open-Source Software Development Assistant [81.86136560157266]
SoTaNa is an open-source software development assistant.
It generates high-quality instruction-based data for the domain of software engineering.
It employs a parameter-efficient fine-tuning approach to enhance the open-source foundation model, LLaMA.
arXiv Detail & Related papers (2023-08-25T14:56:21Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - Enhancing Chat Language Models by Scaling High-quality Instructional
Conversations [91.98516412612739]
We first provide a systematically designed, diverse, informative, large-scale dataset of instructional conversations, UltraChat.
Our objective is to capture the breadth of interactions that a human might have with an AI assistant.
We fine-tune a LLaMA model to create a powerful conversational model, UltraLLaMA.
arXiv Detail & Related papers (2023-05-23T16:49:14Z) - GLM-Dialog: Noise-tolerant Pre-training for Knowledge-grounded Dialogue
Generation [21.91914619107555]
GLM-Dialog is a large-scale language model (LLM) with 10B parameters capable of knowledge-grounded conversation in Chinese.
We offer our evaluation platform online in an effort to prompt the development of open source models and reliable dialogue evaluation systems.
arXiv Detail & Related papers (2023-02-28T08:35:28Z) - ADVISER: A Toolkit for Developing Multi-modal, Multi-domain and
Socially-engaged Conversational Agents [27.222054181839095]
ADVISER is an open-source, multi-domain dialog system toolkit.
It enables the development of multi-modal (incorporating speech, text and vision) conversational agents.
The final Python-based implementation of our toolkit is flexible, easy to use, and easy to extend.
arXiv Detail & Related papers (2020-05-04T18:27:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.