Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
- URL: http://arxiv.org/abs/2204.00598v1
- Date: Fri, 1 Apr 2022 17:43:13 GMT
- Title: Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
- Authors: Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico
Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent
Vanhoucke, Pete Florence
- Abstract summary: Large foundation models can exhibit unique capabilities depending on the domain of data they are trained on.
We show that this model diversity is symbiotic, and can be leveraged to build AI systems with structured Socratic dialogue.
- Score: 49.82293730925404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large foundation models can exhibit unique capabilities depending on the
domain of data they are trained on. While these domains are generic, they may
only barely overlap. For example, visual-language models (VLMs) are trained on
Internet-scale image captions, but large language models (LMs) are further
trained on Internet-scale text with no images (e.g. from spreadsheets, to SAT
questions). As a result, these models store different forms of commonsense
knowledge across different domains. In this work, we show that this model
diversity is symbiotic, and can be leveraged to build AI systems with
structured Socratic dialogue -- in which new multimodal tasks are formulated as
a guided language-based exchange between different pre-existing foundation
models, without additional finetuning. In the context of egocentric perception,
we present a case study of Socratic Models (SMs) that can provide meaningful
results for complex tasks such as generating free-form answers to contextual
questions about egocentric video, by formulating video Q&A as short story Q&A,
i.e. summarizing the video into a short story, then answering questions about
it. Additionally, SMs can generate captions for Internet images, and are
competitive with state-of-the-art on zero-shot video-to-text retrieval with
42.8 R@1 on MSR-VTT 1k-A. SMs demonstrate how to compose foundation models
zero-shot to capture new multimodal functionalities, without domain-specific
data collection. Prototypes are available at socraticmodels.github.io.
Related papers
- UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - Language Is Not All You Need: Aligning Perception with Language Models [110.51362453720458]
We introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context, and follow instructions.
We train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data.
Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP.
We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language
arXiv Detail & Related papers (2023-02-27T18:55:27Z) - FashionVQA: A Domain-Specific Visual Question Answering System [2.6924405243296134]
We train a visual question answering (VQA) system to answer complex natural language questions about apparel in fashion photoshoot images.
The accuracy of the best model surpasses the human expert level, even when answering human-generated questions.
Our approach for generating a large-scale multimodal domain-specific dataset provides a path for training specialized models capable of communicating in natural language.
arXiv Detail & Related papers (2022-08-24T01:18:13Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.