VRAG: Region Attention Graphs for Content-Based Video Retrieval
- URL: http://arxiv.org/abs/2205.09068v1
- Date: Wed, 18 May 2022 16:50:45 GMT
- Title: VRAG: Region Attention Graphs for Content-Based Video Retrieval
- Authors: Kennard Ng, Ser-Nam Lim, Gim Hee Lee
- Abstract summary: Region Attention Graph Networks (VRAG) improves the state-of-the-art video-level methods.
VRAG represents videos at a finer granularity via region-level features and encodes video-temporal dynamics through region-level relations.
We show that the performance gap between video-level and frame-level methods can be reduced by segmenting videos into shots and using shot embeddings for video retrieval.
- Score: 85.54923500208041
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Content-based Video Retrieval (CBVR) is used on media-sharing platforms for
applications such as video recommendation and filtering. To manage databases
that scale to billions of videos, video-level approaches that use fixed-size
embeddings are preferred due to their efficiency. In this paper, we introduce
Video Region Attention Graph Networks (VRAG) that improves the state-of-the-art
of video-level methods. We represent videos at a finer granularity via
region-level features and encode video spatio-temporal dynamics through
region-level relations. Our VRAG captures the relationships between regions
based on their semantic content via self-attention and the permutation
invariant aggregation of Graph Convolution. In addition, we show that the
performance gap between video-level and frame-level methods can be reduced by
segmenting videos into shots and using shot embeddings for video retrieval. We
evaluate our VRAG over several video retrieval tasks and achieve a new
state-of-the-art for video-level retrieval. Furthermore, our shot-level VRAG
shows higher retrieval precision than other existing video-level methods, and
closer performance to frame-level methods at faster evaluation speeds. Finally,
our code will be made publicly available.
Related papers
- EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval [52.375143786641196]
EgoCVR is an evaluation benchmark for fine-grained Composed Video Retrieval.
EgoCVR consists of 2,295 queries that specifically focus on high-quality temporal video understanding.
arXiv Detail & Related papers (2024-07-23T17:19:23Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - SB-VQA: A Stack-Based Video Quality Assessment Framework for Video
Enhancement [0.40777876591043155]
We propose a stack-based framework for video quality assessment (VQA) that outperforms existing state-of-the-art methods on enhanced videos.
In addition to proposing the VQA framework for enhanced videos, we also investigate its application on professionally generated content (PGC)
Our experiments demonstrate that existing VQA algorithms can be applied to PGC videos, and we find that VQA performance for PGC videos can be improved by considering the plot of a play.
arXiv Detail & Related papers (2023-05-15T07:44:10Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - A Hierarchical Multi-Modal Encoder for Moment Localization in Video
Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query.
To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level.
We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.