Searching for Two-Stream Models in Multivariate Space for Video
Recognition
- URL: http://arxiv.org/abs/2108.12957v1
- Date: Mon, 30 Aug 2021 02:03:28 GMT
- Title: Searching for Two-Stream Models in Multivariate Space for Video
Recognition
- Authors: Xinyu Gong, Heng Wang, Zheng Shou, Matt Feiszli, Zhangyang Wang and
Zhicheng Yan
- Abstract summary: We present a pragmatic neural architecture search approach, which is able to search for two-stream video models in giant spaces efficiently.
We demonstrate two-stream models with significantly better performance can be automatically discovered in our design space.
- Score: 80.25356538056839
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conventional video models rely on a single stream to capture the complex
spatial-temporal features. Recent work on two-stream video models, such as
SlowFast network and AssembleNet, prescribe separate streams to learn
complementary features, and achieve stronger performance. However, manually
designing both streams as well as the in-between fusion blocks is a daunting
task, requiring to explore a tremendously large design space. Such manual
exploration is time-consuming and often ends up with sub-optimal architectures
when computational resources are limited and the exploration is insufficient.
In this work, we present a pragmatic neural architecture search approach, which
is able to search for two-stream video models in giant spaces efficiently. We
design a multivariate search space, including 6 search variables to capture a
wide variety of choices in designing two-stream models. Furthermore, we propose
a progressive search procedure, by searching for the architecture of individual
streams, fusion blocks, and attention blocks one after the other. We
demonstrate two-stream models with significantly better performance can be
automatically discovered in our design space. Our searched two-stream models,
namely Auto-TSNet, consistently outperform other models on standard benchmarks.
On Kinetics, compared with the SlowFast model, our Auto-TSNet-L model reduces
FLOPS by nearly 11 times while achieving the same accuracy 78.9%. On
Something-Something-V2, Auto-TSNet-M improves the accuracy by at least 2% over
other methods which use less than 50 GFLOPS per video.
Related papers
- Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z) - TAPIR: Tracking Any Point with per-frame Initialization and temporal
Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations.
The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z) - PV-NAS: Practical Neural Architecture Search for Video Recognition [83.77236063613579]
Deep neural networks for video tasks is highly customized and the design of such networks requires domain experts and costly trial and error tests.
Recent advance in network architecture search has boosted the image recognition performance in a large margin.
In this study, we propose a practical solution, namely Practical Video Neural Architecture Search (PV-NAS)
arXiv Detail & Related papers (2020-11-02T08:50:23Z) - Deep-n-Cheap: An Automated Search Framework for Low Complexity Deep
Learning [3.479254848034425]
We present Deep-n-Cheap -- an open-source AutoML framework to search for deep learning models.
Our framework is targeted for deployment on both benchmark and custom datasets.
Deep-n-Cheap includes a user-customizable complexity penalty which trades off performance with training time or number of parameters.
arXiv Detail & Related papers (2020-03-27T13:00:21Z) - STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition [39.58542259261567]
We present a novel S-Temporal Hybrid Network (STH) which simultaneously encodes spatial and temporal video information with a small parameter.
Such a design enables efficient-temporal modeling and maintains a small model scale.
STH enjoys performance superiority over 3D CNNs while maintaining an even smaller parameter cost than 2D CNNs.
arXiv Detail & Related papers (2020-03-18T04:46:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.