OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for
Video-Grounded Dialog
- URL: http://arxiv.org/abs/2402.13146v1
- Date: Tue, 20 Feb 2024 17:00:59 GMT
- Title: OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for
Video-Grounded Dialog
- Authors: Adnen Abdessaied, Manuel von Hochmeister, Andreas Bulling
- Abstract summary: OLViT is a novel model for video dialog operating over a multi-modal attention-based dialog state tracker.
It maintains a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST)
- Score: 10.290057801577662
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present the Object Language Video Transformer (OLViT) - a novel model for
video dialog operating over a multi-modal attention-based dialog state tracker.
Existing video dialog models struggle with questions requiring both spatial and
temporal localization within videos, long-term temporal reasoning, and accurate
object tracking across multiple dialog turns. OLViT addresses these challenges
by maintaining a global dialog state based on the output of an Object State
Tracker (OST) and a Language State Tracker (LST): while the OST attends to the
most important objects within the video, the LST keeps track of the most
important linguistic co-references to previous dialog turns. In stark contrast
to previous works, our approach is generic by nature and is therefore capable
of learning continuous multi-modal dialog state representations of the most
relevant objects and rounds. As a result, they can be seamlessly integrated
into Large Language Models (LLMs) and offer high flexibility in dealing with
different datasets and tasks. Evaluations on the challenging DVD (response
classification) and SIMMC 2.1 (response generation) datasets show that OLViT
achieves new state-of-the-art performance across both datasets.
Related papers
- Grounding is All You Need? Dual Temporal Grounding for Video Dialog [48.3411605700214]
This paper introduces the Dual Temporal Grounding-enhanced Video Dialog model (DTGVD)
It emphasizes dual temporal relationships by predicting dialog turn-specific temporal regions.
It also filters video content accordingly, and grounding responses in both video and dialog contexts.
arXiv Detail & Related papers (2024-10-08T07:48:34Z) - Enhancing Visual Dialog State Tracking through Iterative Object-Entity Alignment in Multi-Round Conversations [3.784841749866846]
We introduce Multi-round Dialogue State Tracking model (MDST)
MDST captures each round of dialog history, constructing internal dialogue state representations defined as 2-tuples of vision-language representations.
Experimental results on the VisDial v1.0 dataset demonstrate that MDST achieves a new state-of-the-art performance in generative setting.
arXiv Detail & Related papers (2024-08-13T08:36:15Z) - Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT)
We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information.
Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z) - VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic
Understanding with Scene and Topic Transitions [47.94531693056304]
Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics.
We present Video-grounded Scene&Topic AwaRe dialogue dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series.
arXiv Detail & Related papers (2023-05-30T05:40:37Z) - Multimodal Dialogue State Tracking [97.25466640240619]
Video-Dialogue Transformer Network (VDTN) learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
arXiv Detail & Related papers (2022-06-16T03:18:42Z) - Back to the Future: Bidirectional Information Decoupling Network for
Multi-turn Dialogue Modeling [80.51094098799736]
We propose Bidirectional Information Decoupling Network (BiDeN) as a universal dialogue encoder.
BiDeN explicitly incorporates both the past and future contexts and can be generalized to a wide range of dialogue-related tasks.
Experimental results on datasets of different downstream tasks demonstrate the universality and effectiveness of our BiDeN.
arXiv Detail & Related papers (2022-04-18T03:51:46Z) - BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded
Dialogues [95.8297116307127]
We propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos.
Specifically, our approach exploits both spatial and temporal-level information, and learns dynamic information diffusion between the two feature spaces.
BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark.
arXiv Detail & Related papers (2020-10-20T07:43:00Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.