Temporal Saliency Query Network for Efficient Video Recognition
- URL: http://arxiv.org/abs/2207.10379v1
- Date: Thu, 21 Jul 2022 09:23:34 GMT
- Title: Temporal Saliency Query Network for Efficient Video Recognition
- Authors: Boyang Xia, Zhihao Wang, Wenhao Wu, Haoran Wang, Jungong Han
- Abstract summary: Video recognition is a hot-spot research topic with the explosive growth of multimedia data on the Internet and mobile devices.
Most existing methods select the salient frames without awareness of the class-specific saliency scores.
We propose a novel Temporal Saliency Query (TSQ) mechanism, which introduces class-specific information to provide fine-grained cues for saliency measurement.
- Score: 82.52760040577864
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Efficient video recognition is a hot-spot research topic with the explosive
growth of multimedia data on the Internet and mobile devices. Most existing
methods select the salient frames without awareness of the class-specific
saliency scores, which neglect the implicit association between the saliency of
frames and its belonging category. To alleviate this issue, we devise a novel
Temporal Saliency Query (TSQ) mechanism, which introduces class-specific
information to provide fine-grained cues for saliency measurement.
Specifically, we model the class-specific saliency measuring process as a
query-response task. For each category, the common pattern of it is employed as
a query and the most salient frames are responded to it. Then, the calculated
similarities are adopted as the frame saliency scores. To achieve it, we
propose a Temporal Saliency Query Network (TSQNet) that includes two
instantiations of the TSQ mechanism based on visual appearance similarities and
textual event-object relations. Afterward, cross-modality interactions are
imposed to promote the information exchange between them. Finally, we use the
class-specific saliencies of the most confident categories generated by two
modalities to perform the selection of salient frames. Extensive experiments
demonstrate the effectiveness of our method by achieving state-of-the-art
results on ActivityNet, FCVID and Mini-Kinetics datasets. Our project page is
at https://lawrencexia2008.github.io/projects/tsqnet .
Related papers
- Temporal-aware Hierarchical Mask Classification for Video Semantic
Segmentation [62.275143240798236]
Video semantic segmentation dataset has limited categories per video.
Less than 10% of queries could be matched to receive meaningful gradient updates during VSS training.
Our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.
arXiv Detail & Related papers (2023-09-14T20:31:06Z) - Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation Using
a New Frame Selection Policy and Gating Mechanism [8.395400675921515]
Gated-ViGAT is an efficient approach for video event recognition.
It uses bottom-up (object) information, a new frame sampling policy and a gating mechanism.
Gated-ViGAT provides a large computational complexity reduction in comparison to our previous approach.
arXiv Detail & Related papers (2023-01-18T14:36:22Z) - What and When to Look?: Temporal Span Proposal Network for Video Visual
Relation Detection [4.726777092009554]
Video Visual Relation Detection (VidD): segment-based and window-based.
We first point out the limitations these two methods have and propose Temporal Span Proposal Network (TSPN), a novel method with two advantages in terms of efficiency and effectiveness.
arXiv Detail & Related papers (2021-07-15T07:01:26Z) - Temporal Query Networks for Fine-grained Video Understanding [88.9877174286279]
We cast this into a query-response mechanism, where each query addresses a particular question, and has its own response label set.
We evaluate the method extensively on the FineGym and Diving48 benchmarks for fine-grained action classification and surpass the state-of-the-art using only RGB features.
arXiv Detail & Related papers (2021-04-19T17:58:48Z) - BriNet: Towards Bridging the Intra-class and Inter-class Gaps in
One-Shot Segmentation [84.2925550033094]
Few-shot segmentation focuses on the generalization of models to segment unseen object instances with limited training samples.
We propose a framework, BriNet, to bridge the gaps between the extracted features of the query and support images.
The effectiveness of our framework is demonstrated by experimental results, which outperforms other competitive methods.
arXiv Detail & Related papers (2020-08-14T07:45:50Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z) - Evaluating Temporal Queries Over Video Feeds [25.04363138106074]
Temporal queries involving objects and their co-occurrences in video feeds are of interest to many applications ranging from law enforcement to security and safety.
We present an architecture consisting of three layers, namely object detection/tracking, intermediate data generation and query evaluation.
We propose two techniques,MFS and SSG, to organize all detected objects in the intermediate data generation layer.
We also introduce an algorithm called State Traversal (ST) that processes incoming frames against the SSG and efficiently prunes objects and frames unrelated to query evaluation.
arXiv Detail & Related papers (2020-03-02T14:55:57Z) - Convolutional Hierarchical Attention Network for Query-Focused Video
Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs.
We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module.
In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.