i-Code Studio: A Configurable and Composable Framework for Integrative
AI
- URL: http://arxiv.org/abs/2305.13738v1
- Date: Tue, 23 May 2023 06:45:55 GMT
- Title: i-Code Studio: A Configurable and Composable Framework for Integrative
AI
- Authors: Yuwei Fang, Mahmoud Khademi, Chenguang Zhu, Ziyi Yang, Reid Pryzant,
Yichong Xu, Yao Qian, Takuya Yoshioka, Lu Yuan, Michael Zeng and Xuedong
Huang
- Abstract summary: We propose the i-Code Studio, a flexible and composable framework for Integrative AI.
The i-Code Studio orchestrates multiple pre-trained models in a fine-tuning-free fashion to conduct complex multimodal tasks.
The i-Code Studio achieves impressive results on a variety of zero-shot multimodal tasks, such as video-to-text retrieval, speech-to-speech translation, and visual question answering.
- Score: 93.74891865028867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Artificial General Intelligence (AGI) requires comprehensive understanding
and generation capabilities for a variety of tasks spanning different
modalities and functionalities. Integrative AI is one important direction to
approach AGI, through combining multiple models to tackle complex multimodal
tasks. However, there is a lack of a flexible and composable platform to
facilitate efficient and effective model composition and coordination. In this
paper, we propose the i-Code Studio, a configurable and composable framework
for Integrative AI. The i-Code Studio orchestrates multiple pre-trained models
in a finetuning-free fashion to conduct complex multimodal tasks. Instead of
simple model composition, the i-Code Studio provides an integrative, flexible,
and composable setting for developers to quickly and easily compose
cutting-edge services and technologies tailored to their specific requirements.
The i-Code Studio achieves impressive results on a variety of zero-shot
multimodal tasks, such as video-to-text retrieval, speech-to-speech
translation, and visual question answering. We also demonstrate how to quickly
build a multimodal agent based on the i-Code Studio that can communicate and
personalize for users.
Related papers
- WavLLM: Towards Robust and Adaptive Speech Large Language Model [93.0773293897888]
We introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter.
We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set.
arXiv Detail & Related papers (2024-03-31T12:01:32Z) - Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action [46.76487873983082]
Unified-IO 2 is the first autoregressive multimodal model capable of understanding and generating image, text, audio, and action.
We train our model from scratch on a large multimodal pre-training corpus from diverse sources.
With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark.
arXiv Detail & Related papers (2023-12-28T17:57:06Z) - SAI: Solving AI Tasks with Systematic Artificial Intelligence in
Communication Network [4.302209772725456]
Systematic Artificial Intelligence (SAI) is a framework designed to solve AI tasks by leveraging Large Language Models (LLMs) and intent-format-based input.
SAI can complete numerous complex AI tasks in the communication network, achieving impressive results in network optimization, resource allocation, and other challenging tasks.
arXiv Detail & Related papers (2023-10-13T12:14:58Z) - Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and
Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information.
Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations.
We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z) - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z) - mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining.
It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z) - i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations.
The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning.
Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.