VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic
Understanding with Scene and Topic Transitions
- URL: http://arxiv.org/abs/2305.18756v1
- Date: Tue, 30 May 2023 05:40:37 GMT
- Title: VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic
Understanding with Scene and Topic Transitions
- Authors: Yuxuan Wang, Zilong Zheng, Xueliang Zhao, Jinpeng Li, Yueqian Wang,
and Dongyan Zhao
- Abstract summary: Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics.
We present Video-grounded Scene&Topic AwaRe dialogue dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series.
- Score: 47.94531693056304
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Video-grounded dialogue understanding is a challenging problem that requires
machine to perceive, parse and reason over situated semantics extracted from
weakly aligned video and dialogues. Most existing benchmarks treat both
modalities the same as a frame-independent visual understanding task, while
neglecting the intrinsic attributes in multimodal dialogues, such as scene and
topic transitions. In this paper, we present Video-grounded Scene&Topic AwaRe
dialogue (VSTAR) dataset, a large scale video-grounded dialogue understanding
dataset based on 395 TV series. Based on VSTAR, we propose two benchmarks for
video-grounded dialogue understanding: scene segmentation and topic
segmentation, and one benchmark for video-grounded dialogue generation.
Comprehensive experiments are performed on these benchmarks to demonstrate the
importance of multimodal information and segments in video-grounded dialogue
understanding and generation.
Related papers
- OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for
Video-Grounded Dialog [10.290057801577662]
OLViT is a novel model for video dialog operating over a multi-modal attention-based dialog state tracker.
It maintains a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST)
arXiv Detail & Related papers (2024-02-20T17:00:59Z) - Multi-turn Dialogue Comprehension from a Topic-aware Perspective [70.37126956655985]
This paper proposes to model multi-turn dialogues from a topic-aware perspective.
We use a dialogue segmentation algorithm to split a dialogue passage into topic-concentrated fragments in an unsupervised way.
We also present a novel model, Topic-Aware Dual-Attention Matching (TADAM) Network, which takes topic segments as processing elements.
arXiv Detail & Related papers (2023-09-18T11:03:55Z) - Unsupervised Dialogue Topic Segmentation with Topic-aware Utterance
Representation [51.22712675266523]
Dialogue Topic (DTS) plays an essential role in a variety of dialogue modeling tasks.
We propose a novel unsupervised DTS framework, which learns topic-aware utterance representations from unlabeled dialogue data.
arXiv Detail & Related papers (2023-05-04T11:35:23Z) - Multimodal Dialogue State Tracking [97.25466640240619]
Video-Dialogue Transformer Network (VDTN) learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
arXiv Detail & Related papers (2022-06-16T03:18:42Z) - $C^3$: Compositional Counterfactual Contrastive Learning for
Video-grounded Dialogues [97.25466640240619]
Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses relevant to both the dialogue and video context.
Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available.
We propose a novel approach of Compositional Counterfactual Contrastive Learning to develop contrastive training between factual and counterfactual samples in video-grounded dialogues.
arXiv Detail & Related papers (2021-06-16T16:05:27Z) - OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual
Contexts [35.57757367869986]
We release bf OpenViDial, a large-scale multi- module dialogue dataset.
OpenViDial contains a total number of 1.1 million dialogue turns.
We propose a family of encoder-decoder models leveraging both textual and visual contexts.
arXiv Detail & Related papers (2020-12-30T03:02:50Z) - BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded
Dialogues [95.8297116307127]
We propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos.
Specifically, our approach exploits both spatial and temporal-level information, and learns dynamic information diffusion between the two feature spaces.
BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark.
arXiv Detail & Related papers (2020-10-20T07:43:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.