DSTC8-AVSD: Multimodal Semantic Transformer Network with Retrieval Style
Word Generator
- URL: http://arxiv.org/abs/2004.08299v1
- Date: Wed, 1 Apr 2020 07:10:08 GMT
- Title: DSTC8-AVSD: Multimodal Semantic Transformer Network with Retrieval Style
Word Generator
- Authors: Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim, Trung
Bui and Kyomin Jung
- Abstract summary: Audio Visual Scene-aware Dialog (AVSD) is the task of generating a response for a question with a given scene, video, audio, and the history of previous turns in the dialog.
Existing systems for this task employ the transformers or recurrent neural network-based architecture with the encoder-decoder framework.
We propose a Multimodal Semantic Transformer Network. It employs a transformer-based architecture with an attention-based word embedding layer that generates words by querying word embeddings.
- Score: 61.70748716353692
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio Visual Scene-aware Dialog (AVSD) is the task of generating a response
for a question with a given scene, video, audio, and the history of previous
turns in the dialog. Existing systems for this task employ the transformers or
recurrent neural network-based architecture with the encoder-decoder framework.
Even though these techniques show superior performance for this task, they have
significant limitations: the model easily overfits only to memorize the
grammatical patterns; the model follows the prior distribution of the
vocabularies in a dataset. To alleviate the problems, we propose a Multimodal
Semantic Transformer Network. It employs a transformer-based architecture with
an attention-based word embedding layer that generates words by querying word
embeddings. With this design, our model keeps considering the meaning of the
words at the generation stage. The empirical results demonstrate the
superiority of our proposed model that outperforms most of the previous works
for the AVSD task.
Related papers
- VioLA: Unified Codec Language Models for Speech Recognition, Synthesis,
and Translation [91.39949385661379]
VioLA is a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text.
We first convert all the speech utterances to discrete tokens using an offline neural encoder.
We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks.
arXiv Detail & Related papers (2023-05-25T14:39:47Z) - Inflected Forms Are Redundant in Question Generation Models [27.49894653349779]
We propose an approach to enhance the performance of Question Generation using an encoder-decoder framework.
Firstly, we identify the inflected forms of words from the input of encoder, and replace them with the root words.
Secondly, we propose to adapt QG as a combination of the following actions in the encode-decoder framework: generating a question word, copying a word from the source sequence or generating a word transformation type.
arXiv Detail & Related papers (2023-01-01T13:08:11Z) - Instruction-Following Agents with Multimodal Transformer [95.70039658112873]
We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments.
Our method consists of a multimodal transformer that encodes visual observations and language instructions.
We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings.
arXiv Detail & Related papers (2022-10-24T17:46:47Z) - DialogVED: A Pre-trained Latent Variable Encoder-Decoder Model for
Dialog Response Generation [80.45816053153722]
DialogVED introduces continuous latent variables into the enhanced encoder-decoder pre-training framework to increase the relevance and diversity of responses.
We conduct experiments on PersonaChat, DailyDialog, and DSTC7-AVSD benchmarks for response generation.
arXiv Detail & Related papers (2022-04-27T16:18:15Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Hierarchical Transformer for Task Oriented Dialog Systems [11.743662338418867]
We show how a standard transformer can be morphed into any hierarchical encoder, including HRED and HIBERT like models, by using specially designed attention masks and positional encodings.
We demonstrate that Hierarchical Hierarchical helps achieve better natural language understanding of the contexts in transformer-based models for task-oriented dialog systems.
arXiv Detail & Related papers (2020-10-24T10:08:52Z) - Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z) - Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture.
We focus on the scene context provided by the visual information, to ground the ASR.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.