GraphVid: It Only Takes a Few Nodes to Understand a Video
- URL: http://arxiv.org/abs/2207.01375v1
- Date: Mon, 4 Jul 2022 12:52:54 GMT
- Title: GraphVid: It Only Takes a Few Nodes to Understand a Video
- Authors: Eitan Kosman and Dotan Di Castro
- Abstract summary: We propose a concise representation of videos that encode perceptually meaningful features into graphs.
We construct superpixel-based graph representations of videos by considering superpixels as graph nodes.
We leverage Graph Convolutional Networks to process this representation and predict the desired output.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a concise representation of videos that encode perceptually
meaningful features into graphs. With this representation, we aim to leverage
the large amount of redundancies in videos and save computations. First, we
construct superpixel-based graph representations of videos by considering
superpixels as graph nodes and create spatial and temporal connections between
adjacent superpixels. Then, we leverage Graph Convolutional Networks to process
this representation and predict the desired output. As a result, we are able to
train models with much fewer parameters, which translates into short training
periods and a reduction in computation resource requirements. A comprehensive
experimental study on the publicly available datasets Kinetics-400 and Charades
shows that the proposed method is highly cost-effective and uses limited
commodity hardware during training and inference. It reduces the computational
requirements 10-fold while achieving results that are comparable to
state-of-the-art methods. We believe that the proposed approach is a promising
direction that could open the door to solving video understanding more
efficiently and enable more resource limited users to thrive in this research
field.
Related papers
- Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs.
Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z) - VideoSAGE: Video Summarization with Graph Representation Learning [9.21019970479227]
We propose a graph-based representation learning framework for video summarization.
A graph constructed this way aims to capture long-range interactions among video frames, and the sparsity ensures the model trains without hitting the memory and compute bottleneck.
arXiv Detail & Related papers (2024-04-14T15:49:02Z) - Deep Prompt Tuning for Graph Transformers [55.2480439325792]
Fine-tuning is resource-intensive and requires storing multiple copies of large models.
We propose a novel approach called deep graph prompt tuning as an alternative to fine-tuning.
By freezing the pre-trained parameters and only updating the added tokens, our approach reduces the number of free parameters and eliminates the need for multiple model copies.
arXiv Detail & Related papers (2023-09-18T20:12:17Z) - Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision.
A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive.
We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z) - Memory Efficient Temporal & Visual Graph Model for Unsupervised Video
Domain Adaptation [50.158454960223274]
Existing video domain adaption (DA) methods need to store all temporal combinations of video frames or pair the source and target videos.
We propose a memory-efficient graph-based video DA approach.
arXiv Detail & Related papers (2022-08-13T02:56:10Z) - Efficient training for future video generation based on hierarchical
disentangled representation of latent variables [66.94698064734372]
We propose a novel method for generating future prediction videos with less memory usage than the conventional methods.
We achieve high-efficiency by training our method in two stages: (1) image reconstruction to encode video frames into latent variables, and (2) latent variable prediction to generate the future sequence.
Our experiments show that the proposed method can efficiently generate future prediction videos, even for complex datasets that cannot be handled by previous methods.
arXiv Detail & Related papers (2021-06-07T10:43:23Z) - PGT: A Progressive Method for Training Models on Long Videos [45.935259079953255]
Main-stream method is to split a raw video into clips, leading to incomplete temporal information flow.
Inspired by natural language processing techniques dealing with long sentences, we propose to treat videos as serial fragments satisfying Markov property.
We empirically demonstrate that it yields significant performance improvements on different models and datasets.
arXiv Detail & Related papers (2021-03-21T06:15:20Z) - Fast Interactive Video Object Segmentation with Graph Neural Networks [0.0]
We present a graph neural network based approach for tackling the problem of interactive video object segmentation.
Our network operates on superpixel-graphs which allow us to reduce the dimensionality of the problem by several magnitudes.
arXiv Detail & Related papers (2021-03-05T17:37:12Z) - Towards Efficient Scene Understanding via Squeeze Reasoning [71.1139549949694]
We propose a novel framework called Squeeze Reasoning.
Instead of propagating information on the spatial map, we first learn to squeeze the input feature into a channel-wise global vector.
We show that our approach can be modularized as an end-to-end trained block and can be easily plugged into existing networks.
arXiv Detail & Related papers (2020-11-06T12:17:01Z) - About Graph Degeneracy, Representation Learning and Scalability [2.029783382155471]
We present two techniques taking advantage of the K-Core Decomposition to reduce the time and memory consumption of walk-based Graph Representation Learning algorithms.
We evaluate the performances, expressed in terms of quality of embedding and computational resources, of the proposed techniques on several academic datasets.
arXiv Detail & Related papers (2020-09-04T09:39:43Z) - SIGN: Scalable Inception Graph Neural Networks [4.5158585619109495]
We propose a new, efficient and scalable graph deep learning architecture that sidesteps the need for graph sampling.
Our architecture allows using different local graph operators to best suit the task at hand.
We obtain state-of-the-art results on ogbn-papers100M, the largest public graph dataset, with over 110 million nodes and 1.5 billion edges.
arXiv Detail & Related papers (2020-04-23T14:46:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.