BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded
Dialogues
- URL: http://arxiv.org/abs/2010.10095v1
- Date: Tue, 20 Oct 2020 07:43:00 GMT
- Title: BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded
Dialogues
- Authors: Hung Le, Doyen Sahoo, Nancy F. Chen, Steven C.H. Hoi
- Abstract summary: We propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos.
Specifically, our approach exploits both spatial and temporal-level information, and learns dynamic information diffusion between the two feature spaces.
BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark.
- Score: 95.8297116307127
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-grounded dialogues are very challenging due to (i) the complexity of
videos which contain both spatial and temporal variations, and (ii) the
complexity of user utterances which query different segments and/or different
objects in videos over multiple dialogue turns. However, existing approaches to
video-grounded dialogues often focus on superficial temporal-level visual cues,
but neglect more fine-grained spatial signals from videos. To address this
drawback, we propose Bi-directional Spatio-Temporal Learning (BiST), a
vision-language neural framework for high-resolution queries in videos based on
textual cues. Specifically, our approach not only exploits both spatial and
temporal-level information, but also learns dynamic information diffusion
between the two feature spaces through spatial-to-temporal and
temporal-to-spatial reasoning. The bidirectional strategy aims to tackle the
evolving semantics of user queries in the dialogue setting. The retrieved
visual cues are used as contextual information to construct relevant responses
to the users. Our empirical results and comprehensive qualitative analysis show
that BiST achieves competitive performance and generates reasonable responses
on a large-scale AVSD benchmark. We also adapt our BiST models to the Video QA
setting, and substantially outperform prior approaches on the TGIF-QA
benchmark.
Related papers
- Grounding is All You Need? Dual Temporal Grounding for Video Dialog [48.3411605700214]
This paper introduces the Dual Temporal Grounding-enhanced Video Dialog model (DTGVD)
It emphasizes dual temporal relationships by predicting dialog turn-specific temporal regions.
It also filters video content accordingly, and grounding responses in both video and dialog contexts.
arXiv Detail & Related papers (2024-10-08T07:48:34Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Rethinking Multi-Modal Alignment in Video Question Answering from
Feature and Sample Perspectives [30.666823939595627]
This paper reconsiders the multi-modal alignment problem in VideoQA from feature and sample perspectives.
We adopt a heterogeneous graph architecture and design a hierarchical framework to align both trajectory-level and frame-level visual feature with language feature.
Our method outperforms all the state-of-the-art models on the challenging NExT-QA benchmark.
arXiv Detail & Related papers (2022-04-25T10:42:07Z) - Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal
Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields.
Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance.
We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.