Multimodal Dialogue State Tracking
- URL: http://arxiv.org/abs/2206.07898v1
- Date: Thu, 16 Jun 2022 03:18:42 GMT
- Title: Multimodal Dialogue State Tracking
- Authors: Hung Le, Nancy F. Chen, Steven C.H. Hoi
- Abstract summary: Video-Dialogue Transformer Network (VDTN) learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
- Score: 97.25466640240619
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Designed for tracking user goals in dialogues, a dialogue state tracker is an
essential component in a dialogue system. However, the research of dialogue
state tracking has largely been limited to unimodality, in which slots and slot
values are limited by knowledge domains (e.g. restaurant domain with slots of
restaurant name and price range) and are defined by specific database schema.
In this paper, we propose to extend the definition of dialogue state tracking
to multimodality. Specifically, we introduce a novel dialogue state tracking
task to track the information of visual objects that are mentioned in
video-grounded dialogues. Each new dialogue utterance may introduce a new video
segment, new visual objects, or new object attributes, and a state tracker is
required to update these information slots accordingly. We created a new
synthetic benchmark and designed a novel baseline, Video-Dialogue Transformer
Network (VDTN), for this task. VDTN combines both object-level features and
segment-level features and learns contextual dependencies between videos and
dialogues to generate multimodal dialogue states. We optimized VDTN for a state
generation task as well as a self-supervised video understanding task which
recovers video segment or object representations. Finally, we trained VDTN to
use the decoded states in a response prediction task. Together with
comprehensive ablation and qualitative analysis, we discovered interesting
insights towards building more capable multimodal dialogue systems.
Related papers
- OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for
Video-Grounded Dialog [10.290057801577662]
OLViT is a novel model for video dialog operating over a multi-modal attention-based dialog state tracker.
It maintains a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST)
arXiv Detail & Related papers (2024-02-20T17:00:59Z) - VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic
Understanding with Scene and Topic Transitions [47.94531693056304]
Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics.
We present Video-grounded Scene&Topic AwaRe dialogue dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series.
arXiv Detail & Related papers (2023-05-30T05:40:37Z) - A Unified Framework for Slot based Response Generation in a Multimodal
Dialogue System [25.17100881568308]
Natural Language Understanding (NLU) and Natural Language Generation (NLG) are the two critical components of every conversational system.
We propose an end-to-end framework with the capability to extract necessary slot values from the utterance.
We employ a multimodal hierarchical encoder using pre-trained DialoGPT to provide a stronger context for both tasks.
arXiv Detail & Related papers (2023-05-27T10:06:03Z) - Act-Aware Slot-Value Predicting in Multi-Domain Dialogue State Tracking [5.816391291790977]
Dialogue state tracking (DST) aims to track human-machine interactions and generate state representations for managing the dialogue.
Recent advances in machine reading comprehension predict both categorical and non-categorical types of slots for dialogue state tracking.
We formulate and incorporate dialogue acts, and leverage recent advances in machine reading comprehension to predict both categorical and non-categorical types of slots for dialogue state tracking.
arXiv Detail & Related papers (2022-08-04T05:18:30Z) - Beyond the Granularity: Multi-Perspective Dialogue Collaborative
Selection for Dialogue State Tracking [18.172993687706708]
In dialogue state tracking, dialogue history is a crucial material, and its utilization varies between different models.
We propose DiCoS-DST to dynamically select the relevant dialogue contents corresponding to each slot for state updating.
Our approach achieves new state-of-the-art performance on MultiWOZ 2.1 and MultiWOZ 2.2, and achieves superior performance on multiple mainstream benchmark datasets.
arXiv Detail & Related papers (2022-05-20T10:08:45Z) - Back to the Future: Bidirectional Information Decoupling Network for
Multi-turn Dialogue Modeling [80.51094098799736]
We propose Bidirectional Information Decoupling Network (BiDeN) as a universal dialogue encoder.
BiDeN explicitly incorporates both the past and future contexts and can be generalized to a wide range of dialogue-related tasks.
Experimental results on datasets of different downstream tasks demonstrate the universality and effectiveness of our BiDeN.
arXiv Detail & Related papers (2022-04-18T03:51:46Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z) - UniConv: A Unified Conversational Neural Architecture for Multi-domain
Task-oriented Dialogues [101.96097419995556]
"UniConv" is a novel unified neural architecture for end-to-end conversational systems in task-oriented dialogues.
We conduct comprehensive experiments in dialogue state tracking, context-to-text, and end-to-end settings on the MultiWOZ2.1 benchmark.
arXiv Detail & Related papers (2020-04-29T16:28:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.