DAM: Deliberation, Abandon and Memory Networks for Generating Detailed
and Non-repetitive Responses in Visual Dialogue
- URL: http://arxiv.org/abs/2007.03310v1
- Date: Tue, 7 Jul 2020 09:49:47 GMT
- Title: DAM: Deliberation, Abandon and Memory Networks for Generating Detailed
and Non-repetitive Responses in Visual Dialogue
- Authors: Xiaoze Jiang, Jing Yu, Yajing Sun, Zengchang Qin, Zihao Zhu, Yue Hu,
Qi Wu
- Abstract summary: We propose a novel generative decoding architecture to generate high-quality responses.
In this architecture, word generation is decomposed into a series of attention-based information selection steps.
The responses contain more detailed and non-repetitive descriptions while maintaining the semantic accuracy.
- Score: 29.330198609132207
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Dialogue task requires an agent to be engaged in a conversation with
human about an image. The ability of generating detailed and non-repetitive
responses is crucial for the agent to achieve human-like conversation. In this
paper, we propose a novel generative decoding architecture to generate
high-quality responses, which moves away from decoding the whole encoded
semantics towards the design that advocates both transparency and flexibility.
In this architecture, word generation is decomposed into a series of
attention-based information selection steps, performed by the novel recurrent
Deliberation, Abandon and Memory (DAM) module. Each DAM module performs an
adaptive combination of the response-level semantics captured from the encoder
and the word-level semantics specifically selected for generating each word.
Therefore, the responses contain more detailed and non-repetitive descriptions
while maintaining the semantic accuracy. Furthermore, DAM is flexible to
cooperate with existing visual dialogue encoders and adaptive to the encoder
structures by constraining the information selection mode in DAM. We apply DAM
to three typical encoders and verify the performance on the VisDial v1.0
dataset. Experimental results show that the proposed models achieve new
state-of-the-art performance with high-quality responses. The code is available
at https://github.com/JXZe/DAM.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Disentangled Variational Autoencoder for Emotion Recognition in
Conversations [14.92924920489251]
We propose a VAD-disentangled Variational AutoEncoder (VAD-VAE) for Emotion Recognition in Conversations (ERC)
VAD-VAE disentangles three affect representations Valence-Arousal-Dominance (VAD) from the latent space.
Experiments show that VAD-VAE outperforms the state-of-the-art model on two datasets.
arXiv Detail & Related papers (2023-05-23T13:50:06Z) - Dialogue Meaning Representation for Task-Oriented Dialogue Systems [51.91615150842267]
We propose Dialogue Meaning Representation (DMR), a flexible and easily extendable representation for task-oriented dialogue.
Our representation contains a set of nodes and edges with inheritance hierarchy to represent rich semantics for compositional semantics and task-specific concepts.
We propose two evaluation tasks to evaluate different machine learning based dialogue models, and further propose a novel coreference resolution model GNNCoref for the graph-based coreference resolution task.
arXiv Detail & Related papers (2022-04-23T04:17:55Z) - Do Encoder Representations of Generative Dialogue Models Encode
Sufficient Information about the Task ? [41.36218215755317]
We showcase evaluating the text generated through human or automatic metrics is not sufficient to appropriately evaluate soundness of the language understanding of dialogue models.
We propose a set of probe tasks to evaluate encoder representation of different language encoders commonly used in dialogue models.
arXiv Detail & Related papers (2021-06-20T04:52:37Z) - Question Answering Infused Pre-training of General-Purpose
Contextualized Representations [70.62967781515127]
We propose a pre-training objective based on question answering (QA) for learning general-purpose contextual representations.
We accomplish this goal by training a bi-encoder QA model, which independently encodes passages and questions, to match the predictions of a more accurate cross-encoder model.
We show large improvements over both RoBERTa-large and previous state-of-the-art results on zero-shot and few-shot paraphrase detection.
arXiv Detail & Related papers (2021-06-15T14:45:15Z) - A Template-guided Hybrid Pointer Network for
Knowledge-basedTask-oriented Dialogue Systems [15.654119998970499]
We propose a template-guided hybrid pointer network for the knowledge-based task-oriented dialogue system.
We design a memory pointer network model with a gating mechanism to fully exploit the semantic correlation between the retrieved answers and the ground-truth response.
arXiv Detail & Related papers (2021-06-10T15:49:26Z) - Reasoning in Dialog: Improving Response Generation by Context Reading
Comprehension [49.92173751203827]
In multi-turn dialog, utterances do not always take the full form of sentences.
We propose to improve the response generation performance by examining the model's ability to answer a reading comprehension question.
arXiv Detail & Related papers (2020-12-14T10:58:01Z) - Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.