Multi-step Joint-Modality Attention Network for Scene-Aware Dialogue
System
- URL: http://arxiv.org/abs/2001.06206v1
- Date: Fri, 17 Jan 2020 09:18:00 GMT
- Title: Multi-step Joint-Modality Attention Network for Scene-Aware Dialogue
System
- Authors: Yun-Wei Chu, Kuan-Yen Lin, Chao-Chun Hsu, Lun-Wei Ku
- Abstract summary: We propose a multi-step joint-modality attention network (JMAN) based on recurrent neural network (RNN) to reason on videos.
Our model achieves a relative 12.1% and 22.4% improvement over the baseline on ROUGE-L score and CIDEr score.
- Score: 13.687071779732285
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding dynamic scenes and dialogue contexts in order to converse with
users has been challenging for multimodal dialogue systems. The 8-th Dialog
System Technology Challenge (DSTC8) proposed an Audio Visual Scene-Aware Dialog
(AVSD) task, which contains multiple modalities including audio, vision, and
language, to evaluate how dialogue systems understand different modalities and
response to users. In this paper, we proposed a multi-step joint-modality
attention network (JMAN) based on recurrent neural network (RNN) to reason on
videos. Our model performs a multi-step attention mechanism and jointly
considers both visual and textual representations in each reasoning process to
better integrate information from the two different modalities. Compared to the
baseline released by AVSD organizers, our model achieves a relative 12.1% and
22.4% improvement over the baseline on ROUGE-L score and CIDEr score.
Related papers
- VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic
Understanding with Scene and Topic Transitions [47.94531693056304]
Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics.
We present Video-grounded Scene&Topic AwaRe dialogue dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series.
arXiv Detail & Related papers (2023-05-30T05:40:37Z) - A Unified Framework for Slot based Response Generation in a Multimodal
Dialogue System [25.17100881568308]
Natural Language Understanding (NLU) and Natural Language Generation (NLG) are the two critical components of every conversational system.
We propose an end-to-end framework with the capability to extract necessary slot values from the utterance.
We employ a multimodal hierarchical encoder using pre-trained DialoGPT to provide a stronger context for both tasks.
arXiv Detail & Related papers (2023-05-27T10:06:03Z) - Collaborative Reasoning on Multi-Modal Semantic Graphs for
Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video.
The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs)
We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z) - Scene-Aware Prompt for Multi-modal Dialogue Understanding and Generation [20.693465164885325]
This paper introduces the schemes of Team LingJing's experiments in NLPCC-2022-Shared-Task-4 Multi-modal Dialogue Understanding and Generation (MDUG)
The MDUG task can be divided into two phases: multi-modal context understanding and response generation.
To fully leverage the visual information for both scene understanding and dialogue generation, we propose the scene-aware prompt for the MDUG task.
arXiv Detail & Related papers (2022-07-05T05:54:20Z) - Multimodal Dialogue State Tracking [97.25466640240619]
Video-Dialogue Transformer Network (VDTN) learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
arXiv Detail & Related papers (2022-06-16T03:18:42Z) - OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual
Contexts [35.57757367869986]
We release bf OpenViDial, a large-scale multi- module dialogue dataset.
OpenViDial contains a total number of 1.1 million dialogue turns.
We propose a family of encoder-decoder models leveraging both textual and visual contexts.
arXiv Detail & Related papers (2020-12-30T03:02:50Z) - Overview of the Ninth Dialog System Technology Challenge: DSTC9 [111.35889309106359]
The Ninth Dialog System Technology Challenge (DSTC-9) focuses on applying end-to-end dialog technologies for four distinct tasks in dialog systems.
This paper describes the task definition, provided datasets, baselines and evaluation set-up for each track.
arXiv Detail & Related papers (2020-11-12T16:43:10Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z) - Multimodal Transformer with Pointer Network for the DSTC8 AVSD Challenge [48.905496060794114]
We describe our submission to the AVSD track of the 8th Dialogue System Technology Challenge.
We adopt dot-product attention to combine text and non-text features of input video.
Our systems achieve high performance in automatic metrics and obtain 5th and 6th place in human evaluation.
arXiv Detail & Related papers (2020-02-25T06:41:07Z) - Bridging Text and Video: A Universal Multimodal Transformer for
Video-Audio Scene-Aware Dialog [39.01822389691502]
We propose a universal multimodal transformer and introduce the multi-task learning method to learn joint representations among different modalities.
Our method extends the natural language generation pre-trained model to multimodal dialogue generation task.
arXiv Detail & Related papers (2020-02-01T07:50:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.