Temporal Query Networks for Fine-grained Video Understanding
- URL: http://arxiv.org/abs/2104.09496v1
- Date: Mon, 19 Apr 2021 17:58:48 GMT
- Title: Temporal Query Networks for Fine-grained Video Understanding
- Authors: Chuhan Zhang, Ankush Gupta, Andrew Zisserman
- Abstract summary: We cast this into a query-response mechanism, where each query addresses a particular question, and has its own response label set.
We evaluate the method extensively on the FineGym and Diving48 benchmarks for fine-grained action classification and surpass the state-of-the-art using only RGB features.
- Score: 88.9877174286279
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Our objective in this work is fine-grained classification of actions in
untrimmed videos, where the actions may be temporally extended or may span only
a few frames of the video. We cast this into a query-response mechanism, where
each query addresses a particular question, and has its own response label set.
We make the following four contributions: (I) We propose a new model - a
Temporal Query Network - which enables the query-response functionality, and a
structural understanding of fine-grained actions. It attends to relevant
segments for each query with a temporal attention mechanism, and can be trained
using only the labels for each query. (ii) We propose a new way - stochastic
feature bank update - to train a network on videos of various lengths with the
dense sampling required to respond to fine-grained queries. (iii) We compare
the TQN to other architectures and text supervision methods, and analyze their
pros and cons. Finally, (iv) we evaluate the method extensively on the FineGym
and Diving48 benchmarks for fine-grained action classification and surpass the
state-of-the-art using only RGB features.
Related papers
- Temporal-aware Hierarchical Mask Classification for Video Semantic
Segmentation [62.275143240798236]
Video semantic segmentation dataset has limited categories per video.
Less than 10% of queries could be matched to receive meaningful gradient updates during VSS training.
Our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.
arXiv Detail & Related papers (2023-09-14T20:31:06Z) - Locate before Answering: Answer Guided Question Localization for Video
Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model.
It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z) - Query-Guided Networks for Few-shot Fine-grained Classification and
Person Search [93.80556485668731]
Few-shot fine-grained classification and person search appear as distinct tasks and literature has treated them separately.
We propose a novel unified Query-Guided Network (QGN) applicable to both tasks.
QGN improves on a few recent few-shot fine-grained datasets, outperforming other techniques on CUB by a large margin.
arXiv Detail & Related papers (2022-09-21T10:25:32Z) - Temporal Saliency Query Network for Efficient Video Recognition [82.52760040577864]
Video recognition is a hot-spot research topic with the explosive growth of multimedia data on the Internet and mobile devices.
Most existing methods select the salient frames without awareness of the class-specific saliency scores.
We propose a novel Temporal Saliency Query (TSQ) mechanism, which introduces class-specific information to provide fine-grained cues for saliency measurement.
arXiv Detail & Related papers (2022-07-21T09:23:34Z) - Temporally-Weighted Hierarchical Clustering for Unsupervised Action
Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos.
We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training.
Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z) - Evaluating Temporal Queries Over Video Feeds [25.04363138106074]
Temporal queries involving objects and their co-occurrences in video feeds are of interest to many applications ranging from law enforcement to security and safety.
We present an architecture consisting of three layers, namely object detection/tracking, intermediate data generation and query evaluation.
We propose two techniques,MFS and SSG, to organize all detected objects in the intermediate data generation layer.
We also introduce an algorithm called State Traversal (ST) that processes incoming frames against the SSG and efficiently prunes objects and frames unrelated to query evaluation.
arXiv Detail & Related papers (2020-03-02T14:55:57Z) - Video Monitoring Queries [16.7214343633499]
We study the problem of interactive declarative query processing on video streams.
We introduce a set of approximate filters to speed up queries that involve objects of specific type.
The filters are able to assess quickly if the query predicates are true to proceed with further analysis of the frame.
arXiv Detail & Related papers (2020-02-24T20:53:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.