Related papers: GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning

GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning

URL: http://arxiv.org/abs/2410.09377v1
Date: Sat, 12 Oct 2024 06:01:00 GMT
Title: GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning
Authors: Eileen Wang, Caren Han, Josiah Poon,
Abstract summary: Video paragraph Captioning (VPC) aims to generate paragraph captions that summarises key events within a video. Our framework constructs two graphs: a 'video-specific' temporal graph capturing major events and interactions between multimodal information and commonsense knowledge, and a 'theme graph' representing correlations between words of a specific theme. Results demonstrate superior performance across benchmark datasets.
Score: 4.290482766926506
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video Paragraph Captioning (VPC) aims to generate paragraph captions that summarises key events within a video. Despite recent advancements, challenges persist, notably in effectively utilising multimodal signals inherent in videos and addressing the long-tail distribution of words. The paper introduces a novel multimodal integrated caption generation framework for VPC that leverages information from various modalities and external knowledge bases. Our framework constructs two graphs: a 'video-specific' temporal graph capturing major events and interactions between multimodal information and commonsense knowledge, and a 'theme graph' representing correlations between words of a specific theme. These graphs serve as input for a transformer network with a shared encoder-decoder architecture. We also introduce a node selection module to enhance decoding efficiency by selecting the most relevant nodes from the graphs. Our results demonstrate superior performance across benchmark datasets.

Related papers

Language-Guided Graph Representation Learning for Video Summarization [96.2763459348758]
We propose a novel Language-guided Graph Representation Learning Network (LGRLN) for video summarization.<n>Specifically, we introduce a video graph generator that converts video frames into a structured graph to preserve temporal order and contextual dependencies.<n>Our method outperforms existing approaches across multiple benchmarks.
arXiv Detail & Related papers (2025-11-14T04:35:48Z)
VideoSAGE: Video Summarization with Graph Representation Learning [9.21019970479227]
We propose a graph-based representation learning framework for video summarization. A graph constructed this way aims to capture long-range interactions among video frames, and the sparsity ensures the model trains without hitting the memory and compute bottleneck.
arXiv Detail & Related papers (2024-04-14T15:49:02Z)
MSG-BART: Multi-granularity Scene Graph-Enhanced Encoder-Decoder Language Model for Video-grounded Dialogue Generation [25.273719615694958]
We propose a novel approach named MSG-B-ART which enhances the integration of video information. Specifically, we integrate global and local scene graph into the encoder and decoder, respectively. Extensive experiments are conducted on three video-grounded dialogue benchmarks, which show the significant superiority of MSG-B-ART.
arXiv Detail & Related papers (2023-09-26T04:23:23Z)
Video Captioning with Aggregated Features Based on Dual Graphs and Gated Fusion [6.096411752534632]
The application of video captioning models aims at translating content of videos by using accurate natural language. Existing methods often fail in generating sufficient feature representations of video content. We propose a video captioning model based on dual graphs and gated fusion.
arXiv Detail & Related papers (2023-08-13T05:18:08Z)
Multimodal Graph Transformer for Multimodal Question Answering [9.292566397511763]
We propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.
arXiv Detail & Related papers (2023-04-30T21:22:35Z)
Variational Stacked Local Attention Networks for Diverse Video Captioning [2.492343817244558]
Variational Stacked Local Attention Network exploits low-rank bilinear pooling for self-attentive feature interaction. We evaluate VSLAN on MSVD and MSR-VTT datasets in terms of syntax and diversity.
arXiv Detail & Related papers (2022-01-04T05:14:34Z)
DVCFlow: Modeling Information Flow Towards Human-like Video Captioning [163.71539565491113]
Existing methods mainly generate captions from individual video segments, lacking adaptation to the global visual context. We introduce the concept of information flow to model the progressive information changing across video sequence and captions. Our method significantly outperforms competitive baselines, and generates more human-like text according to subject and objective tests.
arXiv Detail & Related papers (2021-11-19T10:46:45Z)
Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos. It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z)
VLG-Net: Video-Language Graph Matching Network for Video Grounding [57.6661145190528]
Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query. We recast this challenge into an algorithmic graph matching problem. We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets.
arXiv Detail & Related papers (2020-11-19T22:32:03Z)
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task. We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame. Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z)
Comprehensive Information Integration Modeling Framework for Video Titling [124.11296128308396]
We integrate comprehensive sources of information, including the content of consumer-generated videos, the narrative comment sentences supplied by consumers, and the product attributes, in an end-to-end modeling framework. To tackle this issue, the proposed method consists of two processes, i.e., granular-level interaction modeling and abstraction-level story-line summarization. We collect a large-scale dataset accordingly from real-world data in Taobao, a world-leading e-commerce platform.
arXiv Detail & Related papers (2020-06-24T10:38:15Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.