Scene-Aware Prompt for Multi-modal Dialogue Understanding and Generation
- URL: http://arxiv.org/abs/2207.01823v1
- Date: Tue, 5 Jul 2022 05:54:20 GMT
- Title: Scene-Aware Prompt for Multi-modal Dialogue Understanding and Generation
- Authors: Bin Li, Yixuan Weng, Ziyu Ma, Bin Sun and Shutao Li
- Abstract summary: This paper introduces the schemes of Team LingJing's experiments in NLPCC-2022-Shared-Task-4 Multi-modal Dialogue Understanding and Generation (MDUG)
The MDUG task can be divided into two phases: multi-modal context understanding and response generation.
To fully leverage the visual information for both scene understanding and dialogue generation, we propose the scene-aware prompt for the MDUG task.
- Score: 20.693465164885325
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces the schemes of Team LingJing's experiments in
NLPCC-2022-Shared-Task-4 Multi-modal Dialogue Understanding and Generation
(MDUG). The MDUG task can be divided into two phases: multi-modal context
understanding and response generation. To fully leverage the visual information
for both scene understanding and dialogue generation, we propose the
scene-aware prompt for the MDUG task. Specifically, we utilize the
multi-tasking strategy for jointly modelling the scene- and session-
multi-modal understanding. The visual captions are adopted to aware the scene
information, while the fixed-type templated prompt based on the scene- and
session-aware labels are used to further improve the dialogue generation
performance. Extensive experimental results show that the proposed method has
achieved state-of-the-art (SOTA) performance compared with other competitive
methods, where we rank the 1-st in all three subtasks in this MDUG competition.
Related papers
- DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - Collaborative Reasoning on Multi-Modal Semantic Graphs for
Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video.
The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs)
We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z) - UniDU: Towards A Unified Generative Dialogue Understanding Framework [62.8474841241855]
We investigate a unified generative dialogue understanding framework, namely UniDU, to achieve information exchange among DU tasks.
We conduct experiments on ten dialogue understanding datasets, which span five fundamental tasks.
The proposed UniDU framework outperforms task-specific well-designed methods on all 5 tasks.
arXiv Detail & Related papers (2022-04-10T09:32:34Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z) - Multi-View Attention Network for Visual Dialog [5.731758300670842]
It is necessary for an agent to 1) determine the semantic intent of question and 2) align question-relevant textual and visual contents.
We propose Multi-View Attention Network (MVAN), which leverages multiple views about heterogeneous inputs.
MVAN effectively captures the question-relevant information from the dialog history with two complementary modules.
arXiv Detail & Related papers (2020-04-29T08:46:38Z) - Bridging Text and Video: A Universal Multimodal Transformer for
Video-Audio Scene-Aware Dialog [39.01822389691502]
We propose a universal multimodal transformer and introduce the multi-task learning method to learn joint representations among different modalities.
Our method extends the natural language generation pre-trained model to multimodal dialogue generation task.
arXiv Detail & Related papers (2020-02-01T07:50:43Z) - Multi-step Joint-Modality Attention Network for Scene-Aware Dialogue
System [13.687071779732285]
We propose a multi-step joint-modality attention network (JMAN) based on recurrent neural network (RNN) to reason on videos.
Our model achieves a relative 12.1% and 22.4% improvement over the baseline on ROUGE-L score and CIDEr score.
arXiv Detail & Related papers (2020-01-17T09:18:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.