Graph Neural Network for Video Relocalization
- URL: http://arxiv.org/abs/2007.09877v2
- Date: Wed, 26 Jan 2022 08:06:22 GMT
- Title: Graph Neural Network for Video Relocalization
- Authors: Yuan Zhou, Mingfei Wang, Ruolin Wang, Shuwei Huo
- Abstract summary: We find that in video relocalization datasets, there exists a phenomenon showing that there does not exist consistent relationship between feature similarity by frame and feature similarity by video.
Taking this phenomenon into account, in this article, we treat video features as a graph by concatenating the query video feature and proposal video feature along time dimension.
With the power of graph neural networks, we propose a Multi-Graph Feature Fusion Module to fuse the relation feature of this graph.
- Score: 16.67309677191578
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we focus on video relocalization task, which uses a query
video clip as input to retrieve a semantic relative video clip in another
untrimmed long video. we find that in video relocalization datasets, there
exists a phenomenon showing that there does not exist consistent relationship
between feature similarity by frame and feature similarity by video, which
affects the feature fusion among frames. However, existing video relocalization
methods do not fully consider it. Taking this phenomenon into account, in this
article, we treat video features as a graph by concatenating the query video
feature and proposal video feature along time dimension, where each timestep is
treated as a node, each row of the feature matrix is treated as feature of each
node. Then, with the power of graph neural networks, we propose a Multi-Graph
Feature Fusion Module to fuse the relation feature of this graph. After
evaluating our method on ActivityNet v1.2 dataset and Thumos14 dataset, we find
that our proposed method outperforms the state of art methods.
Related papers
- VideoSAGE: Video Summarization with Graph Representation Learning [9.21019970479227]
We propose a graph-based representation learning framework for video summarization.
A graph constructed this way aims to capture long-range interactions among video frames, and the sparsity ensures the model trains without hitting the memory and compute bottleneck.
arXiv Detail & Related papers (2024-04-14T15:49:02Z) - Visual Commonsense-aware Representation Network for Video Captioning [84.67432867555044]
We propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN) for video captioning.
Our method reaches state-of-the-art performance, indicating the effectiveness of our method.
arXiv Detail & Related papers (2022-11-17T11:27:15Z) - Two-Level Temporal Relation Model for Online Video Instance Segmentation [3.9349485816629888]
We propose an online method that is on par with the performance of the offline counterparts.
We introduce a message-passing graph neural network that encodes objects and relates them through time.
Our model achieves trained end-to-end, state-of-the-art performance on the YouTube-VIS dataset.
arXiv Detail & Related papers (2022-10-30T10:01:01Z) - Memory Efficient Temporal & Visual Graph Model for Unsupervised Video
Domain Adaptation [50.158454960223274]
Existing video domain adaption (DA) methods need to store all temporal combinations of video frames or pair the source and target videos.
We propose a memory-efficient graph-based video DA approach.
arXiv Detail & Related papers (2022-08-13T02:56:10Z) - Video and Text Matching with Conditioned Embeddings [81.81028089100727]
We present a method for matching a text sentence from a given corpus to a given video clip and vice versa.
In this work, we encode the dataset data in a way that takes into account the query's relevant information.
We show that our conditioned representation can be transferred to video-guided machine translation, where we improved the current results on VATEX.
arXiv Detail & Related papers (2021-10-21T17:31:50Z) - VLG-Net: Video-Language Graph Matching Network for Video Grounding [57.6661145190528]
Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query.
We recast this challenge into an algorithmic graph matching problem.
We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets.
arXiv Detail & Related papers (2020-11-19T22:32:03Z) - Location-aware Graph Convolutional Networks for Video Question Answering [85.44666165818484]
We propose to represent the contents in the video as a location-aware graph.
Based on the constructed graph, we propose to use graph convolution to infer both the category and temporal locations of an action.
Our method significantly outperforms state-of-the-art methods on TGIF-QA, Youtube2Text-QA, and MSVD-QA datasets.
arXiv Detail & Related papers (2020-08-07T02:12:56Z) - SumGraph: Video Summarization via Recursive Graph Modeling [59.01856443537622]
We propose graph modeling networks for video summarization, termed SumGraph, to represent a relation graph.
We achieve state-of-the-art performance on several benchmarks for video summarization in both supervised and unsupervised manners.
arXiv Detail & Related papers (2020-07-17T08:11:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.