Grounding is All You Need? Dual Temporal Grounding for Video Dialog
- URL: http://arxiv.org/abs/2410.05767v2
- Date: Thu, 14 Nov 2024 11:27:41 GMT
- Title: Grounding is All You Need? Dual Temporal Grounding for Video Dialog
- Authors: You Qin, Wei Ji, Xinze Lan, Hao Fei, Xun Yang, Dan Guo, Roger Zimmermann, Lizi Liao,
- Abstract summary: This paper introduces the Dual Temporal Grounding-enhanced Video Dialog model (DTGVD)
It emphasizes dual temporal relationships by predicting dialog turn-specific temporal regions.
It also filters video content accordingly, and grounding responses in both video and dialog contexts.
- Score: 48.3411605700214
- License:
- Abstract: In the realm of video dialog response generation, the understanding of video content and the temporal nuances of conversation history are paramount. While a segment of current research leans heavily on large-scale pretrained visual-language models and often overlooks temporal dynamics, another delves deep into spatial-temporal relationships within videos but demands intricate object trajectory pre-extractions and sidelines dialog temporal dynamics. This paper introduces the Dual Temporal Grounding-enhanced Video Dialog model (DTGVD), strategically designed to merge the strengths of both dominant approaches. It emphasizes dual temporal relationships by predicting dialog turn-specific temporal regions, filtering video content accordingly, and grounding responses in both video and dialog contexts. One standout feature of DTGVD is its heightened attention to chronological interplay. By recognizing and acting upon the dependencies between different dialog turns, it captures more nuanced conversational dynamics. To further bolster the alignment between video and dialog temporal dynamics, we've implemented a list-wise contrastive learning strategy. Within this framework, accurately grounded turn-clip pairings are designated as positive samples, while less precise pairings are categorized as negative. This refined classification is then funneled into our holistic end-to-end response generation mechanism. Evaluations using AVSD@DSTC-7 and AVSD@DSTC-8 datasets underscore the superiority of our methodology.
Related papers
- VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos.
Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries.
We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z) - OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for
Video-Grounded Dialog [10.290057801577662]
OLViT is a novel model for video dialog operating over a multi-modal attention-based dialog state tracker.
It maintains a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST)
arXiv Detail & Related papers (2024-02-20T17:00:59Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - $C^3$: Compositional Counterfactual Contrastive Learning for
Video-grounded Dialogues [97.25466640240619]
Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses relevant to both the dialogue and video context.
Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available.
We propose a novel approach of Compositional Counterfactual Contrastive Learning to develop contrastive training between factual and counterfactual samples in video-grounded dialogues.
arXiv Detail & Related papers (2021-06-16T16:05:27Z) - BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded
Dialogues [95.8297116307127]
We propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos.
Specifically, our approach exploits both spatial and temporal-level information, and learns dynamic information diffusion between the two feature spaces.
BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark.
arXiv Detail & Related papers (2020-10-20T07:43:00Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z) - Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form
Sentences [107.0776836117313]
Given an un-trimmed video and a declarative/interrogative sentence, STVG aims to localize the-temporal tube of the object queried.
Existing methods cannot tackle the STVG task due to the ineffective tube pre-generation and the lack of novel object relationship modeling.
We present a declarative-Temporal Graph Reasoning Network (STGRN) for this task.
arXiv Detail & Related papers (2020-01-19T19:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.