Video Captioning with Aggregated Features Based on Dual Graphs and Gated
Fusion
- URL: http://arxiv.org/abs/2308.06685v1
- Date: Sun, 13 Aug 2023 05:18:08 GMT
- Title: Video Captioning with Aggregated Features Based on Dual Graphs and Gated
Fusion
- Authors: Yutao Jin, Bin Liu, Jing Wang
- Abstract summary: The application of video captioning models aims at translating content of videos by using accurate natural language.
Existing methods often fail in generating sufficient feature representations of video content.
We propose a video captioning model based on dual graphs and gated fusion.
- Score: 6.096411752534632
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The application of video captioning models aims at translating the content of
videos by using accurate natural language. Due to the complex nature inbetween
object interaction in the video, the comprehensive understanding of
spatio-temporal relations of objects remains a challenging task. Existing
methods often fail in generating sufficient feature representations of video
content. In this paper, we propose a video captioning model based on dual
graphs and gated fusion: we adapt two types of graphs to generate feature
representations of video content and utilize gated fusion to further understand
these different levels of information. Using a dual-graphs model to generate
appearance features and motion features respectively can utilize the content
correlation in frames to generate various features from multiple perspectives.
Among them, dual-graphs reasoning can enhance the content correlation in frame
sequences to generate advanced semantic features; The gated fusion, on the
other hand, aggregates the information in multiple feature representations for
comprehensive video content understanding. The experiments conducted on worldly
used datasets MSVD and MSR-VTT demonstrate state-of-the-art performance of our
proposed approach.
Related papers
- GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning [4.290482766926506]
Video paragraph Captioning (VPC) aims to generate paragraph captions that summarises key events within a video.
Our framework constructs two graphs: a 'video-specific' temporal graph capturing major events and interactions between multimodal information and commonsense knowledge, and a 'theme graph' representing correlations between words of a specific theme.
Results demonstrate superior performance across benchmark datasets.
arXiv Detail & Related papers (2024-10-12T06:01:00Z) - Realizing Video Summarization from the Path of Language-based Semantic Understanding [19.825666473712197]
We propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm.
Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries.
arXiv Detail & Related papers (2024-10-06T15:03:22Z) - Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks [25.96897989272303]
Main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content.
We propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit.
We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video.
arXiv Detail & Related papers (2024-01-06T09:38:55Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action
Segmentation in Videos [0.40778318140713216]
This study introduces a graph-structured approach named Semantic2Graph, to model long-term dependencies in videos.
We have designed positive and negative semantic edges, accompanied by corresponding edge weights, to capture both long-term and short-term semantic relationships in video actions.
arXiv Detail & Related papers (2022-09-13T00:01:23Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - Exploration of Visual Features and their weighted-additive fusion for
Video Captioning [0.7388859384645263]
Video captioning is a popular task that challenges models to describe events in videos using natural language.
In this work, we investigate the ability of various visual feature representations derived from state-of-the-art convolutional neural networks to capture high-level semantic context.
arXiv Detail & Related papers (2021-01-14T07:21:13Z) - Dense Relational Image Captioning via Multi-task Triple-Stream Networks [95.0476489266988]
We introduce dense captioning, a novel task which aims to generate captions with respect to information between objects in a visual scene.
This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding.
arXiv Detail & Related papers (2020-10-08T09:17:55Z) - Comprehensive Information Integration Modeling Framework for Video
Titling [124.11296128308396]
We integrate comprehensive sources of information, including the content of consumer-generated videos, the narrative comment sentences supplied by consumers, and the product attributes, in an end-to-end modeling framework.
To tackle this issue, the proposed method consists of two processes, i.e., granular-level interaction modeling and abstraction-level story-line summarization.
We collect a large-scale dataset accordingly from real-world data in Taobao, a world-leading e-commerce platform.
arXiv Detail & Related papers (2020-06-24T10:38:15Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.