Auto-X3D: Ultra-Efficient Video Understanding via Finer-Grained Neural
Architecture Search
- URL: http://arxiv.org/abs/2112.04710v1
- Date: Thu, 9 Dec 2021 05:40:33 GMT
- Title: Auto-X3D: Ultra-Efficient Video Understanding via Finer-Grained Neural
Architecture Search
- Authors: Yifan Jiang, Xinyu Gong, Junru Wu, Humphrey Shi, Zhicheng Yan,
Zhangyang Wang
- Abstract summary: X3D work presents a new family of efficient video models by expanding a hand-crafted image architecture along multiple axes.
A probabilistic neural architecture search method is adopted to efficiently search in such a large space.
Evaluations on Kinetics and Something-Something-V2 benchmarks confirm our AutoX3D models outperform existing ones in accuracy up to 1.3% under similar FLOPs.
- Score: 73.05693037548932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficient video architecture is the key to deploying video recognition
systems on devices with limited computing resources. Unfortunately, existing
video architectures are often computationally intensive and not suitable for
such applications. The recent X3D work presents a new family of efficient video
models by expanding a hand-crafted image architecture along multiple axes, such
as space, time, width, and depth. Although operating in a conceptually large
space, X3D searches one axis at a time, and merely explored a small set of 30
architectures in total, which does not sufficiently explore the space. This
paper bypasses existing 2D architectures, and directly searched for 3D
architectures in a fine-grained space, where block type, filter number,
expansion ratio and attention block are jointly searched. A probabilistic
neural architecture search method is adopted to efficiently search in such a
large space. Evaluations on Kinetics and Something-Something-V2 benchmarks
confirm our AutoX3D models outperform existing ones in accuracy up to 1.3%
under similar FLOPs, and reduce the computational cost up to x1.74 when
reaching similar performance.
Related papers
- TSP3D: Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding [74.033589504806]
We propose an efficient multi-level convolution architecture for 3D visual grounding.
Our method achieves top inference speed and surpasses previous fastest method by 100% FPS.
arXiv Detail & Related papers (2025-02-14T18:59:59Z) - Flexible Channel Dimensions for Differentiable Architecture Search [50.33956216274694]
We propose a novel differentiable neural architecture search method with an efficient dynamic channel allocation algorithm.
We show that the proposed framework is able to find DNN architectures that are equivalent to previous methods in task accuracy and inference latency.
arXiv Detail & Related papers (2023-06-13T15:21:38Z) - Searching a High-Performance Feature Extractor for Text Recognition
Network [92.12492627169108]
We design a domain-specific search space by exploring principles for having good feature extractors.
As the space is huge and complexly structured, no existing NAS algorithms can be applied.
We propose a two-stage algorithm to effectively search in the space.
arXiv Detail & Related papers (2022-09-27T03:49:04Z) - PVNAS: 3D Neural Architecture Search with Point-Voxel Convolution [26.059213743430192]
We study 3D deep learning from the efficiency perspective.
We propose a novel hardware-efficient 3D primitive, Point-Voxel Convolution (PVConv)
arXiv Detail & Related papers (2022-04-25T17:13:55Z) - Towards Improving the Consistency, Efficiency, and Flexibility of
Differentiable Neural Architecture Search [84.4140192638394]
Most differentiable neural architecture search methods construct a super-net for search and derive a target-net as its sub-graph for evaluation.
In this paper, we introduce EnTranNAS that is composed of Engine-cells and Transit-cells.
Our method also spares much memory and computation cost, which speeds up the search process.
arXiv Detail & Related papers (2021-01-27T12:16:47Z) - Memory-Efficient Hierarchical Neural Architecture Search for Image
Restoration [68.6505473346005]
We propose a memory-efficient hierarchical NAS HiNAS (HiNAS) for image denoising and image super-resolution tasks.
With a single GTX1080Ti GPU, it takes only about 1 hour for searching for denoising network on BSD 500 and 3.5 hours for searching for the super-resolution structure on DIV2K.
arXiv Detail & Related papers (2020-12-24T12:06:17Z) - ISTA-NAS: Efficient and Consistent Neural Architecture Search by Sparse
Coding [86.40042104698792]
We formulate neural architecture search as a sparse coding problem.
In experiments, our two-stage method on CIFAR-10 requires only 0.05 GPU-day for search.
Our one-stage method produces state-of-the-art performances on both CIFAR-10 and ImageNet at the cost of only evaluation time.
arXiv Detail & Related papers (2020-10-13T04:34:24Z) - Making a Case for 3D Convolutions for Object Segmentation in Videos [16.167397418720483]
We show that 3D convolutional networks can be effectively applied to dense video prediction tasks such as salient object segmentation.
We propose a 3D decoder architecture, that comprises novel 3D Global Convolution layers and 3D Refinement modules.
Our approach outperforms existing state-of-the-arts by a large margin on the DAVIS'16 Unsupervised, FBMS and ViSal benchmarks.
arXiv Detail & Related papers (2020-08-26T12:24:23Z) - Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution [34.713667358316286]
Self-driving cars need to understand 3D scenes efficiently and accurately in order to drive safely.
Existing 3D perception models are not able to recognize small instances very well due to the low-resolution voxelization and aggressive downsampling.
We propose Sparse Point-Voxel Convolution (SPVConv), a lightweight 3D module that equips the vanilla Sparse Convolution with the high-resolution point-based branch.
arXiv Detail & Related papers (2020-07-31T14:27:27Z) - X3D: Expanding Architectures for Efficient Video Recognition [21.539880641349693]
X3D is a family of efficient video networks that progressively expand a tiny 2D image classification architecture.
Inspired by feature selection methods in machine learning, a simple stepwise network expansion approach is employed.
We report competitive accuracy at unprecedented efficiency on video classification and detection benchmarks.
arXiv Detail & Related papers (2020-04-09T17:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.