Related papers: GraphVid: It Only Takes a Few Nodes to Understand a Video

GraphVid: It Only Takes a Few Nodes to Understand a Video

URL: http://arxiv.org/abs/2207.01375v1
Date: Mon, 4 Jul 2022 12:52:54 GMT
Title: GraphVid: It Only Takes a Few Nodes to Understand a Video
Authors: Eitan Kosman and Dotan Di Castro
Abstract summary: We propose a concise representation of videos that encode perceptually meaningful features into graphs. We construct superpixel-based graph representations of videos by considering superpixels as graph nodes. We leverage Graph Convolutional Networks to process this representation and predict the desired output.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a concise representation of videos that encode perceptually meaningful features into graphs. With this representation, we aim to leverage the large amount of redundancies in videos and save computations. First, we construct superpixel-based graph representations of videos by considering superpixels as graph nodes and create spatial and temporal connections between adjacent superpixels. Then, we leverage Graph Convolutional Networks to process this representation and predict the desired output. As a result, we are able to train models with much fewer parameters, which translates into short training periods and a reduction in computation resource requirements. A comprehensive experimental study on the publicly available datasets Kinetics-400 and Charades shows that the proposed method is highly cost-effective and uses limited commodity hardware during training and inference. It reduces the computational requirements 10-fold while achieving results that are comparable to state-of-the-art methods. We believe that the proposed approach is a promising direction that could open the door to solving video understanding more efficiently and enable more resource limited users to thrive in this research field.

Related papers

Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs. Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z)
VideoSAGE: Video Summarization with Graph Representation Learning [9.21019970479227]
We propose a graph-based representation learning framework for video summarization. A graph constructed this way aims to capture long-range interactions among video frames, and the sparsity ensures the model trains without hitting the memory and compute bottleneck.
arXiv Detail & Related papers (2024-04-14T15:49:02Z)
Deep Prompt Tuning for Graph Transformers [55.2480439325792]
Fine-tuning is resource-intensive and requires storing multiple copies of large models. We propose a novel approach called deep graph prompt tuning as an alternative to fine-tuning. By freezing the pre-trained parameters and only updating the added tokens, our approach reduces the number of free parameters and eliminates the need for multiple model copies.
arXiv Detail & Related papers (2023-09-18T20:12:17Z)
Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision. A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive. We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z)
Memory Efficient Temporal & Visual Graph Model for Unsupervised Video Domain Adaptation [50.158454960223274]
Existing video domain adaption (DA) methods need to store all temporal combinations of video frames or pair the source and target videos. We propose a memory-efficient graph-based video DA approach.
arXiv Detail & Related papers (2022-08-13T02:56:10Z)
Efficient training for future video generation based on hierarchical disentangled representation of latent variables [66.94698064734372]
We propose a novel method for generating future prediction videos with less memory usage than the conventional methods. We achieve high-efficiency by training our method in two stages: (1) image reconstruction to encode video frames into latent variables, and (2) latent variable prediction to generate the future sequence. Our experiments show that the proposed method can efficiently generate future prediction videos, even for complex datasets that cannot be handled by previous methods.
arXiv Detail & Related papers (2021-06-07T10:43:23Z)
PGT: A Progressive Method for Training Models on Long Videos [45.935259079953255]
Main-stream method is to split a raw video into clips, leading to incomplete temporal information flow. Inspired by natural language processing techniques dealing with long sentences, we propose to treat videos as serial fragments satisfying Markov property. We empirically demonstrate that it yields significant performance improvements on different models and datasets.
arXiv Detail & Related papers (2021-03-21T06:15:20Z)
Fast Interactive Video Object Segmentation with Graph Neural Networks [0.0]
We present a graph neural network based approach for tackling the problem of interactive video object segmentation. Our network operates on superpixel-graphs which allow us to reduce the dimensionality of the problem by several magnitudes.
arXiv Detail & Related papers (2021-03-05T17:37:12Z)
Towards Efficient Scene Understanding via Squeeze Reasoning [71.1139549949694]
We propose a novel framework called Squeeze Reasoning. Instead of propagating information on the spatial map, we first learn to squeeze the input feature into a channel-wise global vector. We show that our approach can be modularized as an end-to-end trained block and can be easily plugged into existing networks.
arXiv Detail & Related papers (2020-11-06T12:17:01Z)
About Graph Degeneracy, Representation Learning and Scalability [2.029783382155471]
We present two techniques taking advantage of the K-Core Decomposition to reduce the time and memory consumption of walk-based Graph Representation Learning algorithms. We evaluate the performances, expressed in terms of quality of embedding and computational resources, of the proposed techniques on several academic datasets.
arXiv Detail & Related papers (2020-09-04T09:39:43Z)
SIGN: Scalable Inception Graph Neural Networks [4.5158585619109495]
We propose a new, efficient and scalable graph deep learning architecture that sidesteps the need for graph sampling. Our architecture allows using different local graph operators to best suit the task at hand. We obtain state-of-the-art results on ogbn-papers100M, the largest public graph dataset, with over 110 million nodes and 1.5 billion edges.
arXiv Detail & Related papers (2020-04-23T14:46:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.