Apollo: Zero-shot MultiModal Reasoning with Multiple Experts
- URL: http://arxiv.org/abs/2310.18369v1
- Date: Wed, 25 Oct 2023 22:36:40 GMT
- Title: Apollo: Zero-shot MultiModal Reasoning with Multiple Experts
- Authors: Daniela Ben-David, Tzuf Paz-Argaman, Reut Tsarfaty
- Abstract summary: We propose a modular framework that leverages the expertise of different foundation models over different modalities and domains.
Our approach enables decentralized command execution and allows each model to both contribute and benefit from the expertise of the other models.
We demonstrate this method on a novel task, audio-aware image captioning, in which an image and audio are given and the task is to generate text that describes the image within the context of the provided audio.
- Score: 14.359111652624899
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a modular framework that leverages the expertise of different
foundation models over different modalities and domains in order to perform a
single, complex, multi-modal task, without relying on prompt engineering or
otherwise tailor-made multi-modal training. Our approach enables decentralized
command execution and allows each model to both contribute and benefit from the
expertise of the other models. Our method can be extended to a variety of
foundation models (including audio and vision), above and beyond only language
models, as it does not depend on prompts. We demonstrate our approach on two
tasks. On the well-known task of stylized image captioning, our experiments
show that our approach outperforms semi-supervised state-of-the-art models,
while being zero-shot and avoiding costly training, data collection, and prompt
engineering. We further demonstrate this method on a novel task, audio-aware
image captioning, in which an image and audio are given and the task is to
generate text that describes the image within the context of the provided
audio. Our code is available on GitHub.
Related papers
- Instruct-Imagen: Image Generation with Multi-modal Instruction [90.04481955523514]
instruct-imagen is a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks.
We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision.
Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain.
arXiv Detail & Related papers (2024-01-03T19:31:58Z) - Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action [46.76487873983082]
Unified-IO 2 is the first autoregressive multimodal model capable of understanding and generating image, text, audio, and action.
We train our model from scratch on a large multimodal pre-training corpus from diverse sources.
With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark.
arXiv Detail & Related papers (2023-12-28T17:57:06Z) - In-Context Learning Unlocked for Diffusion Models [163.54453915874402]
We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models.
We propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input.
The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning.
arXiv Detail & Related papers (2023-05-01T23:03:37Z) - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z) - FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.