Multimodal Transformer with Pointer Network for the DSTC8 AVSD Challenge
- URL: http://arxiv.org/abs/2002.10695v1
- Date: Tue, 25 Feb 2020 06:41:07 GMT
- Title: Multimodal Transformer with Pointer Network for the DSTC8 AVSD Challenge
- Authors: Hung Le, Nancy F. Chen
- Abstract summary: We describe our submission to the AVSD track of the 8th Dialogue System Technology Challenge.
We adopt dot-product attention to combine text and non-text features of input video.
Our systems achieve high performance in automatic metrics and obtain 5th and 6th place in human evaluation.
- Score: 48.905496060794114
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-Visual Scene-Aware Dialog (AVSD) is an extension from Video Question
Answering (QA) whereby the dialogue agent is required to generate natural
language responses to address user queries and carry on conversations. This is
a challenging task as it consists of video features of multiple modalities,
including text, visual, and audio features. The agent also needs to learn
semantic dependencies among user utterances and system responses to make
coherent conversations with humans. In this work, we describe our submission to
the AVSD track of the 8th Dialogue System Technology Challenge. We adopt
dot-product attention to combine text and non-text features of input video. We
further enhance the generation capability of the dialogue agent by adopting
pointer networks to point to tokens from multiple source sequences in each
generation step. Our systems achieve high performance in automatic metrics and
obtain 5th and 6th place in human evaluation among all submissions.
Related papers
- A Unified Framework for Slot based Response Generation in a Multimodal
Dialogue System [25.17100881568308]
Natural Language Understanding (NLU) and Natural Language Generation (NLG) are the two critical components of every conversational system.
We propose an end-to-end framework with the capability to extract necessary slot values from the utterance.
We employ a multimodal hierarchical encoder using pre-trained DialoGPT to provide a stronger context for both tasks.
arXiv Detail & Related papers (2023-05-27T10:06:03Z) - Multimodal Dialogue State Tracking [97.25466640240619]
Video-Dialogue Transformer Network (VDTN) learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
arXiv Detail & Related papers (2022-06-16T03:18:42Z) - Learning to Retrieve Videos by Asking Questions [29.046045230398708]
We propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog.
The key contribution of our framework is a novel multimodal question generator that learns to ask questions that maximize the subsequent video retrieval performance.
We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems.
arXiv Detail & Related papers (2022-05-11T19:14:39Z) - End-to-end Spoken Conversational Question Answering: Task, Dataset and
Model [92.18621726802726]
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts.
We propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows.
Our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering.
arXiv Detail & Related papers (2022-04-29T17:56:59Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - Multimodal Dialogue State Tracking By QA Approach with Data Augmentation [16.436557991074068]
This paper interprets the Audio-Video Scene-Aware Dialogue (AVSD) task from an open-domain Question Answering (QA) point of view.
The proposed QA system uses common encoder-decoder framework with multimodal fusion and attention.
Our experiments show that our model and techniques bring significant improvements over the baseline model on the DSTC7-AVSD dataset.
arXiv Detail & Related papers (2020-07-20T06:23:18Z) - Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z) - Bridging Text and Video: A Universal Multimodal Transformer for
Video-Audio Scene-Aware Dialog [39.01822389691502]
We propose a universal multimodal transformer and introduce the multi-task learning method to learn joint representations among different modalities.
Our method extends the natural language generation pre-trained model to multimodal dialogue generation task.
arXiv Detail & Related papers (2020-02-01T07:50:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.