Related papers: Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization

Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization

URL: http://arxiv.org/abs/2509.05604v1
Date: Sat, 06 Sep 2025 05:37:31 GMT
Title: Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization
Authors: Jungin Park, Jiyoung Lee, Kwanghoon Sohn,
Abstract summary: Video summarization aims to selects that are visually diverse and represent the whole story of a given video.<n>We present VideoGraph, which formulates the objects and frames as nodes of the spatial and temporal graphs.<n>In our experiments, VideoGraph achieves state-of-the-art performance on several benchmarks for generic and querylink video summarization.
Score: 47.65036144170475
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video summarization aims to select keyframes that are visually diverse and can represent the whole story of a given video. Previous approaches have focused on global interlinkability between frames in a video by temporal modeling. However, fine-grained visual entities, such as objects, are also highly related to the main content of the video. Moreover, language-guided video summarization, which has recently been studied, requires a comprehensive linguistic understanding of complex real-world videos. To consider how all the objects are semantically related to each other, this paper regards video summarization as a language-guided spatiotemporal graph modeling problem. We present recursive spatiotemporal graph networks, called VideoGraph, which formulate the objects and frames as nodes of the spatial and temporal graphs, respectively. The nodes in each graph are connected and aggregated with graph edges, representing the semantic relationships between the nodes. To prevent the edges from being configured with visual similarity, we incorporate language queries derived from the video into the graph node representations, enabling them to contain semantic knowledge. In addition, we adopt a recursive strategy to refine initial graphs and correctly classify each frame node as a keyframe. In our experiments, VideoGraph achieves state-of-the-art performance on several benchmarks for generic and query-focused video summarization in both supervised and unsupervised manners. The code is available at https://github.com/park-jungin/videograph.

Related papers

Language-Guided Graph Representation Learning for Video Summarization [96.2763459348758]
We propose a novel Language-guided Graph Representation Learning Network (LGRLN) for video summarization.<n>Specifically, we introduce a video graph generator that converts video frames into a structured graph to preserve temporal order and contextual dependencies.<n>Our method outperforms existing approaches across multiple benchmarks.
arXiv Detail & Related papers (2025-11-14T04:35:48Z)
VideoSAGE: Video Summarization with Graph Representation Learning [9.21019970479227]
We propose a graph-based representation learning framework for video summarization. A graph constructed this way aims to capture long-range interactions among video frames, and the sparsity ensures the model trains without hitting the memory and compute bottleneck.
arXiv Detail & Related papers (2024-04-14T15:49:02Z)
Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action Segmentation in Videos [0.40778318140713216]
This study introduces a graph-structured approach named Semantic2Graph, to model long-term dependencies in videos. We have designed positive and negative semantic edges, accompanied by corresponding edge weights, to capture both long-term and short-term semantic relationships in video actions.
arXiv Detail & Related papers (2022-09-13T00:01:23Z)
Cross-Modal Graph with Meta Concepts for Video Captioning [101.97397967958722]
We propose Cross-Modal Graph (CMG) with meta concepts for video captioning. To cover the useful semantic concepts in video captions, we weakly learn the corresponding visual regions for text descriptions. We construct holistic video-level and local frame-level video graphs with the predicted predicates to model video sequence structures.
arXiv Detail & Related papers (2021-08-14T04:00:42Z)
VLG-Net: Video-Language Graph Matching Network for Video Grounding [57.6661145190528]
Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query. We recast this challenge into an algorithmic graph matching problem. We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets.
arXiv Detail & Related papers (2020-11-19T22:32:03Z)
SumGraph: Video Summarization via Recursive Graph Modeling [59.01856443537622]
We propose graph modeling networks for video summarization, termed SumGraph, to represent a relation graph. We achieve state-of-the-art performance on several benchmarks for video summarization in both supervised and unsupervised manners.
arXiv Detail & Related papers (2020-07-17T08:11:30Z)
Iterative Context-Aware Graph Inference for Visual Dialog [126.016187323249]
We propose a novel Context-Aware Graph (CAG) neural network. Each node in the graph corresponds to a joint semantic feature, including both object-based (visual) and history-related (textual) context representations.
arXiv Detail & Related papers (2020-04-05T13:09:37Z)
Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks [150.5425122989146]
This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS) AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges. Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case.
arXiv Detail & Related papers (2020-01-19T10:45:27Z)
Cut-Based Graph Learning Networks to Discover Compositional Structure of Sequential Video Data [29.841574293529796]
We propose Cut-Based Graph Learning Networks (CB-GLNs) for learning video data by discovering complex structures of the video. CB-GLNs represent video data as a graph, with nodes and edges corresponding to frames of the video and their dependencies respectively. We evaluate the proposed method on the two different tasks for video understanding: Video theme classification (Youtube-8M dataset) and Video Question and Answering (TVQA dataset)
arXiv Detail & Related papers (2020-01-17T10:09:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.