DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever
- URL: http://arxiv.org/abs/2401.01076v2
- Date: Wed, 3 Jan 2024 02:13:29 GMT
- Title: DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever
- Authors: Zhichao Yin, Binyuan Hui, Min Yang, Fei Huang, Yongbin Li
- Abstract summary: We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
- Score: 83.33209603041013
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, substantial advancements in pre-trained vision-language models have
greatly enhanced the capabilities of multi-modal dialog systems. These models
have demonstrated significant improvements by fine-tuning on downstream tasks.
However, the existing pre-trained models primarily focus on effectively
capturing the alignment between vision and language modalities, often ignoring
the intricate nature of dialog context. In this paper, we propose a
parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog
retrieval. Specifically, our approach introduces a multi-modal context prompt
generator to learn context features which are subsequently distilled into
prompts within the pre-trained vision-language model CLIP. Besides, we
introduce domain prompt to mitigate the disc repancy from the downstream dialog
data. To facilitate various types of retrieval, we also design multiple experts
to learn mappings from CLIP outputs to multi-modal representation space, with
each expert being responsible to one specific retrieval type. Extensive
experiments show that DialCLIP achieves state-of-the-art performance on two
widely recognized benchmark datasets (i.e., PhotoChat and MMDialog) by tuning a
mere 0.04% of the total parameters. These results highlight the efficacy and
efficiency of our proposed approach, underscoring its potential to advance the
field of multi-modal dialog retrieval.
Related papers
- Enhancing Multimodal Query Representation via Visual Dialogues for End-to-End Knowledge Retrieval [26.585985828583304]
We propose an end-to-end multimodal retrieval system, Ret-XKnow, to endow a text retriever with the ability to understand multimodal queries.
To effectively learn multimodal interaction, we also introduce the Visual Dialogue-to-Retrieval dataset automatically constructed from visual dialogue datasets.
We demonstrate that our approach not only significantly improves retrieval performance in zero-shot settings but also achieves substantial improvements in fine-tuning scenarios.
arXiv Detail & Related papers (2024-11-13T04:32:58Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Multi-Modal Retrieval For Large Language Model Based Speech Recognition [15.494654232953678]
We propose multi-modal retrieval with two approaches: kNN-LM and cross-attention techniques.
We show that speech-based multi-modal retrieval outperforms text based retrieval.
We achieve state-of-the-art recognition results on the Spoken-Squad question answering dataset.
arXiv Detail & Related papers (2024-06-13T22:55:22Z) - POEM: Interactive Prompt Optimization for Enhancing Multimodal Reasoning of Large Language Models [28.072184039405784]
We present POEM, a visual analytics system to facilitate efficient prompt engineering for large language models (LLMs)
The system enables users to explore the interaction patterns across modalities at varying levels of detail for a comprehensive understanding of the multimodal knowledge elicited by various prompts.
arXiv Detail & Related papers (2024-06-06T08:21:30Z) - DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation [46.085482021301516]
We propose DialogGen to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System.
It is composed of drawing prompt alignment, careful training data curation, and error correction.
Our experiments on DialogGen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.
arXiv Detail & Related papers (2024-03-13T18:00:01Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via
Multi-Modal Causal Attention [55.2825684201129]
DeepSpeed-VisualChat is designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities.
Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions.
arXiv Detail & Related papers (2023-09-25T17:53:29Z) - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action [96.33509740612486]
MM-REACT is a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action.
MM-REACT's prompt design allows language models to accept, associate, and process multimodal information.
arXiv Detail & Related papers (2023-03-20T18:31:47Z) - "Think Before You Speak": Improving Multi-Action Dialog Policy by
Planning Single-Action Dialogs [33.78889030078026]
Multi-action dialog policy (MADP) generates multiple atomic dialog actions per turn.
We propose Planning Enhanced Dialog Policy (PEDP), a novel multi-task learning framework that learns single-action dialog dynamics.
Our fully supervised learning-based method achieves a solid task success rate of 90.6%, improving 3% compared to the state-of-the-art methods.
arXiv Detail & Related papers (2022-04-25T07:55:53Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.