Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual
Transformers with Joint Student-Teacher Learning
- URL: http://arxiv.org/abs/2110.06894v1
- Date: Wed, 13 Oct 2021 17:24:16 GMT
- Title: Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual
Transformers with Joint Student-Teacher Learning
- Authors: Ankit P. Shah, Shijie Geng, Peng Gao, Anoop Cherian, Takaaki Hori, Tim
K. Marks, Jonathan Le Roux, Chiori Hori
- Abstract summary: In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track.
This paper introduces the new task that includes temporal reasoning and our new extension of the AVSD dataset for DSTC10.
- Score: 70.56330507503867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD)
task, collected an AVSD dataset, developed AVSD technologies, and hosted an
AVSD challenge track at both the 7th and 8th Dialog System Technology
Challenges (DSTC7, DSTC8). In these challenges, the best-performing systems
relied heavily on human-generated descriptions of the video content, which were
available in the datasets but would be unavailable in real-world applications.
To promote further advancements for real-world applications, we proposed a
third AVSD challenge, at DSTC10, with two modifications: 1) the human-created
description is unavailable at inference time, and 2) systems must demonstrate
temporal reasoning by finding evidence from the video to support each answer.
This paper introduces the new task that includes temporal reasoning and our new
extension of the AVSD dataset for DSTC10, for which we collected
human-generated temporal reasoning data. We also introduce a baseline system
built using an AV-transformer, which we released along with the new dataset.
Finally, this paper introduces a new system that extends our baseline system
with attentional multimodal fusion, joint student-teacher learning (JSTL), and
model combination techniques, achieving state-of-the-art performances on the
AVSD datasets for DSTC7, DSTC8, and DSTC10. We also propose two temporal
reasoning methods for AVSD: one attention-based, and one based on a time-domain
region proposal network.
Related papers
- TCG CREST System Description for the Second DISPLACE Challenge [19.387615374726444]
We describe the speaker diarization (SD) and language diarization (LD) systems developed by our team for the Second DISPLACE Challenge, 2024.
Our contributions were dedicated to Track 1 for SD and Track 2 for LD in multilingual and multi-speaker scenarios.
arXiv Detail & Related papers (2024-09-16T05:13:34Z) - AutoAD III: The Prequel -- Back to the Pixels [96.27059234129788]
We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these.
We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models.
We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance.
arXiv Detail & Related papers (2024-04-22T17:59:57Z) - The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple
Devices in Diverse Scenarios [61.74042680711718]
We introduce the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge.
This task comprises joint ASR and diarization in far-field settings with multiple, and possibly heterogeneous, recording devices.
The goal is for participants to devise a single system that can generalize across different array geometries.
arXiv Detail & Related papers (2023-06-23T18:49:20Z) - A Comprehensive Survey on Video Saliency Detection with Auditory
Information: the Audio-visual Consistency Perceptual is the Key! [25.436683033432086]
Video saliency detection (VSD) aims at fast locating the most attractive objects/things/patterns in a given video clip.
This paper provides extensive review to bridge the gap between audio-visual fusion and saliency detection.
arXiv Detail & Related papers (2022-06-20T07:25:13Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One
More Step Towards Generalization [65.09758931804478]
Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs.
A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones.
arXiv Detail & Related papers (2022-03-14T13:15:09Z) - A study on joint modeling and data augmentation of multi-modalities for
audio-visual scene classification [64.59834310846516]
We propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC)
Our final system can achieve the best accuracy of 94.2% among all single AVSC systems submitted to DCASE 2021 Task 1b.
arXiv Detail & Related papers (2022-03-07T07:29:55Z) - Audio Visual Scene-Aware Dialog Generation with Transformer-based Video
Representations [20.619819743960868]
We apply the Transformer-based video feature that can capture both temporally and spatially global representations more efficiently than the CNN-based feature.
Our model achieves a subjective score close to that of human answers in DSTC10.
arXiv Detail & Related papers (2022-02-21T04:09:32Z) - Multimodal Dialogue State Tracking By QA Approach with Data Augmentation [16.436557991074068]
This paper interprets the Audio-Video Scene-Aware Dialogue (AVSD) task from an open-domain Question Answering (QA) point of view.
The proposed QA system uses common encoder-decoder framework with multimodal fusion and attention.
Our experiments show that our model and techniques bring significant improvements over the baseline model on the DSTC7-AVSD dataset.
arXiv Detail & Related papers (2020-07-20T06:23:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.