Multimodal Dialogue State Tracking By QA Approach with Data Augmentation
- URL: http://arxiv.org/abs/2007.09903v1
- Date: Mon, 20 Jul 2020 06:23:18 GMT
- Title: Multimodal Dialogue State Tracking By QA Approach with Data Augmentation
- Authors: Xiangyang Mou, Brandyn Sigouin, Ian Steenstra, Hui Su
- Abstract summary: This paper interprets the Audio-Video Scene-Aware Dialogue (AVSD) task from an open-domain Question Answering (QA) point of view.
The proposed QA system uses common encoder-decoder framework with multimodal fusion and attention.
Our experiments show that our model and techniques bring significant improvements over the baseline model on the DSTC7-AVSD dataset.
- Score: 16.436557991074068
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, a more challenging state tracking task, Audio-Video Scene-Aware
Dialogue (AVSD), is catching an increasing amount of attention among
researchers. Different from purely text-based dialogue state tracking, the
dialogue in AVSD contains a sequence of question-answer pairs about a video and
the final answer to the given question requires additional understanding of the
video. This paper interprets the AVSD task from an open-domain Question
Answering (QA) point of view and proposes a multimodal open-domain QA system to
deal with the problem. The proposed QA system uses common encoder-decoder
framework with multimodal fusion and attention. Teacher forcing is applied to
train a natural language generator. We also propose a new data augmentation
approach specifically under QA assumption. Our experiments show that our model
and techniques bring significant improvements over the baseline model on the
DSTC7-AVSD dataset and demonstrate the potentials of our data augmentation
techniques.
Related papers
- Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models.
Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z) - Multimodal Dialogue State Tracking [97.25466640240619]
Video-Dialogue Transformer Network (VDTN) learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
arXiv Detail & Related papers (2022-06-16T03:18:42Z) - Learning to Retrieve Videos by Asking Questions [29.046045230398708]
We propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog.
The key contribution of our framework is a novel multimodal question generator that learns to ask questions that maximize the subsequent video retrieval performance.
We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems.
arXiv Detail & Related papers (2022-05-11T19:14:39Z) - Learning to Answer Questions in Dynamic Audio-Visual Scenarios [81.19017026999218]
We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos.
Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types.
Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
arXiv Detail & Related papers (2022-03-26T13:03:42Z) - Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual
Transformers with Joint Student-Teacher Learning [70.56330507503867]
In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track.
This paper introduces the new task that includes temporal reasoning and our new extension of the AVSD dataset for DSTC10.
arXiv Detail & Related papers (2021-10-13T17:24:16Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - Multimodal Transformer with Pointer Network for the DSTC8 AVSD Challenge [48.905496060794114]
We describe our submission to the AVSD track of the 8th Dialogue System Technology Challenge.
We adopt dot-product attention to combine text and non-text features of input video.
Our systems achieve high performance in automatic metrics and obtain 5th and 6th place in human evaluation.
arXiv Detail & Related papers (2020-02-25T06:41:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.