Video Dialog as Conversation about Objects Living in Space-Time
- URL: http://arxiv.org/abs/2207.03656v1
- Date: Fri, 8 Jul 2022 02:34:38 GMT
- Title: Video Dialog as Conversation about Objects Living in Space-Time
- Authors: Hoang-Anh Pham, Thao Minh Le, Vuong Le, Tu Minh Phuong, Truyen Tran
- Abstract summary: We present a new object-centric framework for video dialog that supports neural reasoning dubbed COST.
COST maintains and tracks object-associated dialog states, which are updated upon receiving new questions.
We evaluate COST on the DSTC7 and DSTC8 benchmarks, demonstrating its competitiveness against state-of-the-arts.
- Score: 35.54055886856042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It would be a technological feat to be able to create a system that can hold
a meaningful conversation with humans about what they watch. A setup toward
that goal is presented as a video dialog task, where the system is asked to
generate natural utterances in response to a question in an ongoing dialog. The
task poses great visual, linguistic, and reasoning challenges that cannot be
easily overcome without an appropriate representation scheme over video and
dialog that supports high-level reasoning. To tackle these challenges we
present a new object-centric framework for video dialog that supports neural
reasoning dubbed COST - which stands for Conversation about Objects in
Space-Time. Here dynamic space-time visual content in videos is first parsed
into object trajectories. Given this video abstraction, COST maintains and
tracks object-associated dialog states, which are updated upon receiving new
questions. Object interactions are dynamically and conditionally inferred for
each question, and these serve as the basis for relational reasoning among
them. COST also maintains a history of previous answers, and this allows
retrieval of relevant object-centric information to enrich the answer forming
process. Language production then proceeds in a step-wise manner, taking into
the context of the current utterance, the existing dialog, the current
question. We evaluate COST on the DSTC7 and DSTC8 benchmarks, demonstrating its
competitiveness against state-of-the-arts.
Related papers
- WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain.
These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech.
Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z) - Enhancing Visual Dialog State Tracking through Iterative Object-Entity Alignment in Multi-Round Conversations [3.784841749866846]
We introduce Multi-round Dialogue State Tracking model (MDST)
MDST captures each round of dialog history, constructing internal dialogue state representations defined as 2-tuples of vision-language representations.
Experimental results on the VisDial v1.0 dataset demonstrate that MDST achieves a new state-of-the-art performance in generative setting.
arXiv Detail & Related papers (2024-08-13T08:36:15Z) - OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for
Video-Grounded Dialog [10.290057801577662]
OLViT is a novel model for video dialog operating over a multi-modal attention-based dialog state tracker.
It maintains a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST)
arXiv Detail & Related papers (2024-02-20T17:00:59Z) - Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog [83.63849872250651]
Video-grounded dialog requires profound understanding of both dialog history and video content for accurate response generation.
We present an iterative search and reasoning framework, which consists of a textual encoder, a visual encoder, and a generator.
arXiv Detail & Related papers (2023-10-11T07:37:13Z) - Multimodal Dialogue State Tracking [97.25466640240619]
Video-Dialogue Transformer Network (VDTN) learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
arXiv Detail & Related papers (2022-06-16T03:18:42Z) - End-to-end Spoken Conversational Question Answering: Task, Dataset and
Model [92.18621726802726]
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts.
We propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows.
Our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering.
arXiv Detail & Related papers (2022-04-29T17:56:59Z) - Unified Questioner Transformer for Descriptive Question Generation in
Goal-Oriented Visual Dialogue [0.0]
Building an interactive artificial intelligence that can ask questions about the real world is one of the biggest challenges for vision and language problems.
We propose a novel Questioner architecture, called Unified Questioner Transformer (UniQer)
We build a goal-oriented visual dialogue task called CLEVR Ask. It synthesizes complex scenes that require the Questioner to generate descriptive questions.
arXiv Detail & Related papers (2021-06-29T16:36:34Z) - Hierarchical Object-oriented Spatio-Temporal Reasoning for Video
Question Answering [27.979053252431306]
Video Question Answering (Video QA) is a powerful testbed to develop new AI capabilities.
We propose an object-oriented reasoning approach in that video is abstracted as a dynamic stream of interacting objects.
This mechanism is materialized into a family of general-purpose neural units and their multi-level architecture.
arXiv Detail & Related papers (2021-06-25T05:12:42Z) - ORD: Object Relationship Discovery for Visual Dialogue Generation [60.471670447176656]
We propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation.
A hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally.
Experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships.
arXiv Detail & Related papers (2020-06-15T12:25:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.