Structured Co-reference Graph Attention for Video-grounded Dialogue
- URL: http://arxiv.org/abs/2103.13361v1
- Date: Wed, 24 Mar 2021 17:36:33 GMT
- Title: Structured Co-reference Graph Attention for Video-grounded Dialogue
- Authors: Junyeong Kim and Sunjae Yoon and Dahyun Kim and Chang D. Yoo
- Abstract summary: The Structured Co-reference Graph Attention (SCGA) is presented for decoding the answer sequence to a question regarding a given video.
Our empirical results show that SCGA outperforms other state-of-the-art dialogue systems on two benchmarks.
- Score: 17.797726722637634
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A video-grounded dialogue system referred to as the Structured Co-reference
Graph Attention (SCGA) is presented for decoding the answer sequence to a
question regarding a given video while keeping track of the dialogue context.
Although recent efforts have made great strides in improving the quality of the
response, performance is still far from satisfactory. The two main challenging
issues are as follows: (1) how to deduce co-reference among multiple modalities
and (2) how to reason on the rich underlying semantic structure of video with
complex spatial and temporal dynamics. To this end, SCGA is based on (1)
Structured Co-reference Resolver that performs dereferencing via building a
structured graph over multiple modalities, (2) Spatio-temporal Video Reasoner
that captures local-to-global dynamics of video via gradually neighboring graph
attention. SCGA makes use of pointer network to dynamically replicate parts of
the question for decoding the answer sequence. The validity of the proposed
SCGA is demonstrated on AVSD@DSTC7 and AVSD@DSTC8 datasets, a challenging
video-grounded dialogue benchmarks, and TVQA dataset, a large-scale videoQA
benchmark. Our empirical results show that SCGA outperforms other
state-of-the-art dialogue systems on both benchmarks, while extensive ablation
study and qualitative analysis reveal performance gain and improved
interpretability.
Related papers
- Understanding Long Videos via LLM-Powered Entity Relation Graphs [51.13422967711056]
GraphVideoAgent is a framework that maps and monitors the evolving relationships between visual entities throughout the video sequence.
Our approach demonstrates remarkable effectiveness when tested against industry benchmarks.
arXiv Detail & Related papers (2025-01-27T10:57:24Z) - SOC: Semantic-Assisted Object Cluster for Referring Video Object
Segmentation [35.063881868130075]
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment.
We propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment.
We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin.
arXiv Detail & Related papers (2023-05-26T15:13:44Z) - Rethinking Multi-Modal Alignment in Video Question Answering from
Feature and Sample Perspectives [30.666823939595627]
This paper reconsiders the multi-modal alignment problem in VideoQA from feature and sample perspectives.
We adopt a heterogeneous graph architecture and design a hierarchical framework to align both trajectory-level and frame-level visual feature with language feature.
Our method outperforms all the state-of-the-art models on the challenging NExT-QA benchmark.
arXiv Detail & Related papers (2022-04-25T10:42:07Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - Target Adaptive Context Aggregation for Video Scene Graph Generation [36.669700084337045]
This paper deals with a challenging task of video scene graph generation (VidSGG)
We present a new em detect-to-track paradigm for this task by decoupling the context modeling for relation prediction from the complicated low-level entity tracking.
arXiv Detail & Related papers (2021-08-18T12:46:28Z) - BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded
Dialogues [95.8297116307127]
We propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos.
Specifically, our approach exploits both spatial and temporal-level information, and learns dynamic information diffusion between the two feature spaces.
BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark.
arXiv Detail & Related papers (2020-10-20T07:43:00Z) - Hierarchical Conditional Relation Networks for Multimodal Video Question
Answering [67.85579756590478]
Video QA adds at least two more layers of complexity - selecting relevant content for each channel in the context of a linguistic query.
Conditional Relation Network (CRN) takes as input a set of tensorial objects translating into a new set of objects that encode relations of the inputs.
CRN is then applied for Video QA in two forms, short-form where answers are reasoned solely from the visual content, and long-form where associated information, such as subtitles, is presented.
arXiv Detail & Related papers (2020-10-18T02:31:06Z) - Jointly Cross- and Self-Modal Graph Attention Network for Query-Based
Moment Localization [77.21951145754065]
We propose a novel Cross- and Self-Modal Graph Attention Network (CSMGAN) that recasts this task as a process of iterative messages passing over a joint graph.
Our CSMGAN is able to effectively capture high-order interactions between two modalities, thus enabling a further precise localization.
arXiv Detail & Related papers (2020-08-04T08:25:24Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.