Video Imprint
- URL: http://arxiv.org/abs/2106.03283v1
- Date: Mon, 7 Jun 2021 00:32:47 GMT
- Title: Video Imprint
- Authors: Zhanning Gao, Le Wang, Nebojsa Jojic, Zhenxing Niu, Nanning Zheng,
Gang Hua
- Abstract summary: A new unified video analytics framework (ER3) is proposed for complex event retrieval, recognition and recounting.
The proposed video imprint representation exploits temporal correlations among image features across video frames.
The video imprint is fed into a reasoning network and a feature aggregation module, for event recognition/recounting and event retrieval tasks, respectively.
- Score: 107.1365846180187
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A new unified video analytics framework (ER3) is proposed for complex event
retrieval, recognition and recounting, based on the proposed video imprint
representation, which exploits temporal correlations among image features
across video frames. With the video imprint representation, it is convenient to
reverse map back to both temporal and spatial locations in video frames,
allowing for both key frame identification and key areas localization within
each frame. In the proposed framework, a dedicated feature alignment module is
incorporated for redundancy removal across frames to produce the tensor
representation, i.e., the video imprint. Subsequently, the video imprint is
individually fed into both a reasoning network and a feature aggregation
module, for event recognition/recounting and event retrieval tasks,
respectively. Thanks to its attention mechanism inspired by the memory networks
used in language modeling, the proposed reasoning network is capable of
simultaneous event category recognition and localization of the key pieces of
evidence for event recounting. In addition, the latent structure in our
reasoning network highlights the areas of the video imprint, which can be
directly used for event recounting. With the event retrieval task, the compact
video representation aggregated from the video imprint contributes to better
retrieval results than existing state-of-the-art methods.
Related papers
- Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Event-aware Video Corpus Moment Retrieval [79.48249428428802]
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos.
Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos.
We propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval.
arXiv Detail & Related papers (2024-02-21T06:55:20Z) - Collaboratively Self-supervised Video Representation Learning for Action
Recognition [58.195372471117615]
We design a Collaboratively Self-supervised Video Representation learning framework specific to action recognition.
Our method achieves state-of-the-art performance on the UCF101 and HMDB51 datasets.
arXiv Detail & Related papers (2024-01-15T10:42:04Z) - Multi-grained Temporal Prototype Learning for Few-shot Video Object
Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images.
We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data.
Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z) - Video Captioning in Compressed Video [1.953018353016675]
We propose a video captioning method which operates directly on the stored compressed videos.
To learn a discriminative visual representation for video captioning, we design a residuals-assisted encoder (RAE), which spots regions of interest in I-frames.
We evaluate our method on two benchmark datasets and demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-01-02T03:06:03Z) - Transforming Multi-Concept Attention into Video Summarization [36.85535624026879]
We propose a novel attention-based framework for video summarization with complex video data.
Our model can be applied to both labeled and unlabeled data, making our method preferable to real-world applications.
arXiv Detail & Related papers (2020-06-02T06:23:50Z) - Near-duplicate video detection featuring coupled temporal and perceptual
visual structures and logical inference based matching [0.0]
We propose an architecture for near-duplicate video detection based on: (i) index and query signature based structures integrating temporal and perceptual visual features.
For matching, we propose to instantiate a retrieval model based on logical inference through the coupling of an N-gram sliding window process and theoretically-sound lattice-based structures.
arXiv Detail & Related papers (2020-05-15T04:45:52Z) - Convolutional Hierarchical Attention Network for Query-Focused Video
Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs.
We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module.
In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.