Related papers: Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos

Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos

URL: http://arxiv.org/abs/2506.09953v1
Date: Wed, 11 Jun 2025 17:23:35 GMT
Title: Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos
Authors: Benjamin Reichman, Constantin Patsch, Jack Truxal, Atishay Jain, Larry Heck,
Abstract summary: We introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns.<n>A conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information.
Score: 3.0758169771529693
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question. Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information. Moreover, the context of the overall conversation must be considered for the subsequent dialogue. To explore this task, we introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns. While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present. Thus, the model not only has to identify relevant video parts but also leverage external knowledge to converse within the dialogue. We further provide several baselines evaluated on our dataset and show future challenges associated with this task. The dataset is made publicly available here: https://github.com/c-patsch/OKCV.

Related papers

InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models [123.1441379479263]
We build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round. For effective data collection, the key idea is to bridge the large-scale multimodal model (e.g., GIT) and the language models (e.g., GPT-3)
arXiv Detail & Related papers (2023-12-21T00:44:45Z)
VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions [47.94531693056304]
Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics. We present Video-grounded Scene&Topic AwaRe dialogue dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series.
arXiv Detail & Related papers (2023-05-30T05:40:37Z)
TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World [97.58623810402563]
We introduce a new video-based multi-modal dialogue dataset, called TikTalk. We collect 38K videos from a popular video-sharing platform, along with 367K conversations posted by users beneath them. Users engage in spontaneous conversations based on their multi-modal experiences from watching videos, which helps recreate real-world chitchat context.
arXiv Detail & Related papers (2023-01-14T10:18:22Z)
Multimodal Dialogue State Tracking [97.25466640240619]
Video-Dialogue Transformer Network (VDTN) learns contextual dependencies between videos and dialogues to generate multimodal dialogue states. VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
arXiv Detail & Related papers (2022-06-16T03:18:42Z)
Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog [12.034554338597067]
We propose a novel model by Reasoning with Multi-structure Commonsense Knowledge (RMK) In our model, the external knowledge is represented with sentence-level facts and graph-level facts. On top of these multi-structure representations, our model can capture relevant knowledge and incorporate them into the vision and semantic features.
arXiv Detail & Related papers (2022-04-10T13:12:10Z)
Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues [73.04906599884868]
We propose a novel framework of Reasoning Paths in Dialogue Context (PDC) PDC model discovers information flows among dialogue turns through a semantic graph constructed based on lexical components in each question and answer. Our model sequentially processes both visual and textual information through this reasoning path and the propagated features are used to generate the answer.
arXiv Detail & Related papers (2021-03-01T07:39:26Z)
DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue [30.930757279692163]
A video-grounded dialogue system is required to understand both dialogue and video. Existing benchmarks do not have enough annotations to help analyze dialogue systems. We present a diagnostic dataset that can test a range of reasoning abilities on videos and dialogues.
arXiv Detail & Related papers (2021-01-01T03:20:22Z)
OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts [35.57757367869986]
We release bf OpenViDial, a large-scale multi- module dialogue dataset. OpenViDial contains a total number of 1.1 million dialogue turns. We propose a family of encoder-decoder models leveraging both textual and visual contexts.
arXiv Detail & Related papers (2020-12-30T03:02:50Z)
VD-BERT: A Unified Vision and Dialog Transformer with BERT [161.0016161052714]
We propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer. We adapt BERT for the effective fusion of vision and dialog contents via visually grounded training. Our model yields new state of the art, achieving the top position in both single-model and ensemble settings.
arXiv Detail & Related papers (2020-04-28T04:08:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.