ChatterBox: Multi-round Multimodal Referring and Grounding
- URL: http://arxiv.org/abs/2401.13307v1
- Date: Wed, 24 Jan 2024 09:02:00 GMT
- Title: ChatterBox: Multi-round Multimodal Referring and Grounding
- Authors: Yunjie Tian and Tianren Ma and Lingxi Xie and Jihao Qiu and Xi Tang
and Yuan Zhang and Jianbin Jiao and Qi Tian and Qixiang Ye
- Abstract summary: We present a new benchmark and an efficient vision-language model for this purpose.
The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks.
Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively.
- Score: 108.9673313949746
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this study, we establish a baseline for a new task named multimodal
multi-round referring and grounding (MRG), opening up a promising direction for
instance-level multimodal dialogues. We present a new benchmark and an
efficient vision-language model for this purpose. The new benchmark, named
CB-300K, spans challenges including multi-round dialogue, complex spatial
relationships among multiple instances, and consistent reasoning, which are
beyond those shown in existing benchmarks. The proposed model, named
ChatterBox, utilizes a two-branch architecture to collaboratively handle vision
and language tasks. By tokenizing instance regions, the language branch
acquires the ability to perceive referential information. Meanwhile, ChatterBox
feeds a query embedding in the vision branch to a token receiver for visual
grounding. A two-stage optimization strategy is devised, making use of both
CB-300K and auxiliary external data to improve the model's stability and
capacity for instance-level understanding. Experiments show that ChatterBox
outperforms existing models in MRG both quantitatively and qualitatively,
paving a new path towards multimodal dialogue scenarios with complicated and
precise interactions. Code, data, and model are available at:
https://github.com/sunsmarterjie/ChatterBox.
Related papers
- From Unimodal to Multimodal: Scaling up Projectors to Align Modalities [16.733970553781887]
We propose a novel approach that aligns vision and language modalities using only projection layers on pretrained, frozen unimodal encoders.
Our method exploits the high semantic similarity between embedding spaces of well-trained vision and language models.
It involves selecting semantically similar encoders in the latent space, curating a concept-rich dataset of image-caption pairs, and training simple projectors.
arXiv Detail & Related papers (2024-09-28T17:57:32Z) - S3: A Simple Strong Sample-effective Multimodal Dialog System [61.31055673156622]
We present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results.
The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector.
arXiv Detail & Related papers (2024-06-26T12:45:43Z) - DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via
Multi-Modal Causal Attention [55.2825684201129]
DeepSpeed-VisualChat is designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities.
Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions.
arXiv Detail & Related papers (2023-09-25T17:53:29Z) - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action [96.33509740612486]
MM-REACT is a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action.
MM-REACT's prompt design allows language models to accept, associate, and process multimodal information.
arXiv Detail & Related papers (2023-03-20T18:31:47Z) - Building Goal-Oriented Dialogue Systems with Situated Visual Context [12.014793558784955]
With the surge of virtual assistants with screen, the next generation of agents are required to understand screen context.
We propose a novel multimodal conversational framework, where the dialogue agent's next action and their arguments are derived jointly conditioned both on the conversational and the visual context.
Our model can recognize visual features such as color and shape as well as the metadata based features such as price or star rating associated with a visual entity.
arXiv Detail & Related papers (2021-11-22T23:30:52Z) - Situated and Interactive Multimodal Conversations [21.391260370502224]
We introduce Situated Interactive MultiModal Conversations (SIMMC) as a new direction aimed at training agents.
We provide two SIMMC datasets totalling 13K human-human dialogs (169K utterances) using a multimodal Wizard-of-Oz (WoZ) setup.
We present several tasks within SIMMC as objective evaluation protocols, such as Structural API Prediction and Response Generation.
arXiv Detail & Related papers (2020-06-02T09:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.