Application of frozen large-scale models to multimodal task-oriented
dialogue
- URL: http://arxiv.org/abs/2310.00845v1
- Date: Mon, 2 Oct 2023 01:42:28 GMT
- Title: Application of frozen large-scale models to multimodal task-oriented
dialogue
- Authors: Tatsuki Kawamoto, Takuma Suzuki, Ko Miyama, Takumi Meguro, Tomohiro
Takagi
- Abstract summary: We use the existing Large Language Models ENnhanced to See Framework (LENS Framework) to test the feasibility of multimodal task-oriented dialogues.
The LENS Framework has been proposed as a method to solve computer vision tasks without additional training and with fixed parameters of pre-trained models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this study, we use the existing Large Language Models ENnhanced to See
Framework (LENS Framework) to test the feasibility of multimodal task-oriented
dialogues. The LENS Framework has been proposed as a method to solve computer
vision tasks without additional training and with fixed parameters of
pre-trained models. We used the Multimodal Dialogs (MMD) dataset, a multimodal
task-oriented dialogue benchmark dataset from the fashion field, and for the
evaluation, we used the ChatGPT-based G-EVAL, which only accepts textual
modalities, with arrangements to handle multimodal data. Compared to
Transformer-based models in previous studies, our method demonstrated an
absolute lift of 10.8% in fluency, 8.8% in usefulness, and 5.2% in relevance
and coherence. The results show that using large-scale models with fixed
parameters rather than using models trained on a dataset from scratch improves
performance in multimodal task-oriented dialogues. At the same time, we show
that Large Language Models (LLMs) are effective for multimodal task-oriented
dialogues. This is expected to lead to efficient applications to existing
systems.
Related papers
- P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - S3: A Simple Strong Sample-effective Multimodal Dialog System [61.31055673156622]
We present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results.
The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector.
arXiv Detail & Related papers (2024-06-26T12:45:43Z) - Multi-Modal Retrieval For Large Language Model Based Speech Recognition [15.494654232953678]
We propose multi-modal retrieval with two approaches: kNN-LM and cross-attention techniques.
We show that speech-based multi-modal retrieval outperforms text based retrieval.
We achieve state-of-the-art recognition results on the Spoken-Squad question answering dataset.
arXiv Detail & Related papers (2024-06-13T22:55:22Z) - Many Hands Make Light Work: Task-Oriented Dialogue System with Module-Based Mixture-of-Experts [9.129081545049992]
Task-oriented dialogue systems have greatly benefited from pre-trained language models (PLMs)
We propose Soft Mixture-of-Expert Task-Oriented Dialogue system (SMETOD)
SMETOD leverages an ensemble of Mixture-of-Experts (MoEs) to excel at subproblems and generate specialized outputs for task-oriented dialogues.
We extensively evaluate our model on three benchmark functionalities: intent prediction, dialogue state tracking, and dialogue response generation.
arXiv Detail & Related papers (2024-05-16T01:02:09Z) - DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation [46.085482021301516]
We propose DialogGen to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System.
It is composed of drawing prompt alignment, careful training data curation, and error correction.
Our experiments on DialogGen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.
arXiv Detail & Related papers (2024-03-13T18:00:01Z) - DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - Unlocking the Potential of User Feedback: Leveraging Large Language
Model as User Simulator to Enhance Dialogue System [65.93577256431125]
We propose an alternative approach called User-Guided Response Optimization (UGRO) to combine it with a smaller task-oriented dialogue model.
This approach uses LLM as annotation-free user simulator to assess dialogue responses, combining them with smaller fine-tuned end-to-end TOD models.
Our approach outperforms previous state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2023-06-16T13:04:56Z) - Gated Mechanism Enhanced Multi-Task Learning for Dialog Routing [28.870359916550996]
Gated Mechanism enhanced Multi-task Model (G3M)
Proposal includes a novel dialog encoder and two tailored gated mechanism modules.
Based on two datasets collected from real world applications, extensive experimental results demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-04-07T16:51:46Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - DialogZoo: Large-Scale Dialog-Oriented Task Learning [52.18193690394549]
We aim to build a unified foundation model which can solve massive diverse dialogue tasks.
To achieve this goal, we first collect a large-scale well-labeled dialogue dataset from 73 publicly available datasets.
arXiv Detail & Related papers (2022-05-25T11:17:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.