MSG-BART: Multi-granularity Scene Graph-Enhanced Encoder-Decoder
Language Model for Video-grounded Dialogue Generation
- URL: http://arxiv.org/abs/2311.12820v1
- Date: Tue, 26 Sep 2023 04:23:23 GMT
- Title: MSG-BART: Multi-granularity Scene Graph-Enhanced Encoder-Decoder
Language Model for Video-grounded Dialogue Generation
- Authors: Hongcheng Liu, Zhe Chen, Hui Li, Pingjie Wang, Yanfeng Wang, Yu Wang
- Abstract summary: We propose a novel approach named MSG-B-ART which enhances the integration of video information.
Specifically, we integrate global and local scene graph into the encoder and decoder, respectively.
Extensive experiments are conducted on three video-grounded dialogue benchmarks, which show the significant superiority of MSG-B-ART.
- Score: 25.273719615694958
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating dialogue grounded in videos requires a high level of understanding
and reasoning about the visual scenes in the videos. However, existing large
visual-language models are not effective due to their latent features and
decoder-only structure, especially with respect to spatio-temporal relationship
reasoning. In this paper, we propose a novel approach named MSG-BART, which
enhances the integration of video information by incorporating a
multi-granularity spatio-temporal scene graph into an encoder-decoder
pre-trained language model. Specifically, we integrate the global and local
scene graph into the encoder and decoder, respectively, to improve both overall
perception and target reasoning capability. To further improve the information
selection capability, we propose a multi-pointer network to facilitate
selection between text and video. Extensive experiments are conducted on three
video-grounded dialogue benchmarks, which show the significant superiority of
the proposed MSG-BART compared to a range of state-of-the-art approaches.
Related papers
- GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning [4.290482766926506]
Video paragraph Captioning (VPC) aims to generate paragraph captions that summarises key events within a video.
Our framework constructs two graphs: a 'video-specific' temporal graph capturing major events and interactions between multimodal information and commonsense knowledge, and a 'theme graph' representing correlations between words of a specific theme.
Results demonstrate superior performance across benchmark datasets.
arXiv Detail & Related papers (2024-10-12T06:01:00Z) - SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval [82.51117533271517]
Previous works typically only encode RGB videos to obtain high-level semantic features.
Existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training.
We propose a novel sign language representation framework called Semantically Enhanced Dual-Stream.
arXiv Detail & Related papers (2024-07-23T11:31:11Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog [83.63849872250651]
Video-grounded dialog requires profound understanding of both dialog history and video content for accurate response generation.
We present an iterative search and reasoning framework, which consists of a textual encoder, a visual encoder, and a generator.
arXiv Detail & Related papers (2023-10-11T07:37:13Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - GL-RG: Global-Local Representation Granularity for Video Captioning [52.56883051799501]
We propose a GL-RG framework for video captioning, namely a textbfGlobal-textbfLocal textbfRepresentation textbfGranularity.
Our GL-RG demonstrates three advantages over the prior efforts: 1) we explicitly exploit extensive visual representations from different video ranges to improve linguistic expression; 2) we devise a novel global-local encoder to produce rich semantic vocabulary to obtain a descriptive granularity of video contents across frames; and 3) we develop an incremental training strategy which organizes model learning in an incremental fashion to incur an optimal captioning
arXiv Detail & Related papers (2022-05-22T02:00:09Z) - Discriminative Latent Semantic Graph for Video Captioning [24.15455227330031]
Video captioning aims to automatically generate natural language sentences that describe the visual contents of a given video.
Our main contribution is to identify three key problems in a joint framework for future video summarization tasks.
arXiv Detail & Related papers (2021-08-08T15:11:20Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.