Bridging Text and Video: A Universal Multimodal Transformer for
Video-Audio Scene-Aware Dialog
- URL: http://arxiv.org/abs/2002.00163v1
- Date: Sat, 1 Feb 2020 07:50:43 GMT
- Title: Bridging Text and Video: A Universal Multimodal Transformer for
Video-Audio Scene-Aware Dialog
- Authors: Zekang Li, Zongjia Li, Jinchao Zhang, Yang Feng, Cheng Niu, Jie Zhou
- Abstract summary: We propose a universal multimodal transformer and introduce the multi-task learning method to learn joint representations among different modalities.
Our method extends the natural language generation pre-trained model to multimodal dialogue generation task.
- Score: 39.01822389691502
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-Visual Scene-Aware Dialog (AVSD) is a task to generate responses when
chatting about a given video, which is organized as a track of the 8th Dialog
System Technology Challenge (DSTC8). To solve the task, we propose a universal
multimodal transformer and introduce the multi-task learning method to learn
joint representations among different modalities as well as generate
informative and fluent responses. Our method extends the natural language
generation pre-trained model to multimodal dialogue generation task. Our system
achieves the best performance in both objective and subjective evaluations in
the challenge.
Related papers
- Collaborative Reasoning on Multi-Modal Semantic Graphs for
Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video.
The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs)
We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z) - Scene-Aware Prompt for Multi-modal Dialogue Understanding and Generation [20.693465164885325]
This paper introduces the schemes of Team LingJing's experiments in NLPCC-2022-Shared-Task-4 Multi-modal Dialogue Understanding and Generation (MDUG)
The MDUG task can be divided into two phases: multi-modal context understanding and response generation.
To fully leverage the visual information for both scene understanding and dialogue generation, we propose the scene-aware prompt for the MDUG task.
arXiv Detail & Related papers (2022-07-05T05:54:20Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z) - Multimodal Transformer with Pointer Network for the DSTC8 AVSD Challenge [48.905496060794114]
We describe our submission to the AVSD track of the 8th Dialogue System Technology Challenge.
We adopt dot-product attention to combine text and non-text features of input video.
Our systems achieve high performance in automatic metrics and obtain 5th and 6th place in human evaluation.
arXiv Detail & Related papers (2020-02-25T06:41:07Z) - Multi-step Joint-Modality Attention Network for Scene-Aware Dialogue
System [13.687071779732285]
We propose a multi-step joint-modality attention network (JMAN) based on recurrent neural network (RNN) to reason on videos.
Our model achieves a relative 12.1% and 22.4% improvement over the baseline on ROUGE-L score and CIDEr score.
arXiv Detail & Related papers (2020-01-17T09:18:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.