Related papers: Auto-X3D: Ultra-Efficient Video Understanding via Finer-Grained Neural Architecture Search

Auto-X3D: Ultra-Efficient Video Understanding via Finer-Grained Neural Architecture Search

URL: http://arxiv.org/abs/2112.04710v1
Date: Thu, 9 Dec 2021 05:40:33 GMT
Title: Auto-X3D: Ultra-Efficient Video Understanding via Finer-Grained Neural Architecture Search
Authors: Yifan Jiang, Xinyu Gong, Junru Wu, Humphrey Shi, Zhicheng Yan, Zhangyang Wang
Abstract summary: X3D work presents a new family of efficient video models by expanding a hand-crafted image architecture along multiple axes. A probabilistic neural architecture search method is adopted to efficiently search in such a large space. Evaluations on Kinetics and Something-Something-V2 benchmarks confirm our AutoX3D models outperform existing ones in accuracy up to 1.3% under similar FLOPs.
Score: 73.05693037548932
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Efficient video architecture is the key to deploying video recognition systems on devices with limited computing resources. Unfortunately, existing video architectures are often computationally intensive and not suitable for such applications. The recent X3D work presents a new family of efficient video models by expanding a hand-crafted image architecture along multiple axes, such as space, time, width, and depth. Although operating in a conceptually large space, X3D searches one axis at a time, and merely explored a small set of 30 architectures in total, which does not sufficiently explore the space. This paper bypasses existing 2D architectures, and directly searched for 3D architectures in a fine-grained space, where block type, filter number, expansion ratio and attention block are jointly searched. A probabilistic neural architecture search method is adopted to efficiently search in such a large space. Evaluations on Kinetics and Something-Something-V2 benchmarks confirm our AutoX3D models outperform existing ones in accuracy up to 1.3% under similar FLOPs, and reduce the computational cost up to x1.74 when reaching similar performance.

Related papers

TSP3D: Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding [74.033589504806]
We propose an efficient multi-level convolution architecture for 3D visual grounding. Our method achieves top inference speed and surpasses previous fastest method by 100% FPS.
arXiv Detail & Related papers (2025-02-14T18:59:59Z)
Flexible Channel Dimensions for Differentiable Architecture Search [50.33956216274694]
We propose a novel differentiable neural architecture search method with an efficient dynamic channel allocation algorithm. We show that the proposed framework is able to find DNN architectures that are equivalent to previous methods in task accuracy and inference latency.
arXiv Detail & Related papers (2023-06-13T15:21:38Z)
Searching a High-Performance Feature Extractor for Text Recognition Network [92.12492627169108]
We design a domain-specific search space by exploring principles for having good feature extractors. As the space is huge and complexly structured, no existing NAS algorithms can be applied. We propose a two-stage algorithm to effectively search in the space.
arXiv Detail & Related papers (2022-09-27T03:49:04Z)
PVNAS: 3D Neural Architecture Search with Point-Voxel Convolution [26.059213743430192]
We study 3D deep learning from the efficiency perspective. We propose a novel hardware-efficient 3D primitive, Point-Voxel Convolution (PVConv)
arXiv Detail & Related papers (2022-04-25T17:13:55Z)
Towards Improving the Consistency, Efficiency, and Flexibility of Differentiable Neural Architecture Search [84.4140192638394]
Most differentiable neural architecture search methods construct a super-net for search and derive a target-net as its sub-graph for evaluation. In this paper, we introduce EnTranNAS that is composed of Engine-cells and Transit-cells. Our method also spares much memory and computation cost, which speeds up the search process.
arXiv Detail & Related papers (2021-01-27T12:16:47Z)
Memory-Efficient Hierarchical Neural Architecture Search for Image Restoration [68.6505473346005]
We propose a memory-efficient hierarchical NAS HiNAS (HiNAS) for image denoising and image super-resolution tasks. With a single GTX1080Ti GPU, it takes only about 1 hour for searching for denoising network on BSD 500 and 3.5 hours for searching for the super-resolution structure on DIV2K.
arXiv Detail & Related papers (2020-12-24T12:06:17Z)
ISTA-NAS: Efficient and Consistent Neural Architecture Search by Sparse Coding [86.40042104698792]
We formulate neural architecture search as a sparse coding problem. In experiments, our two-stage method on CIFAR-10 requires only 0.05 GPU-day for search. Our one-stage method produces state-of-the-art performances on both CIFAR-10 and ImageNet at the cost of only evaluation time.
arXiv Detail & Related papers (2020-10-13T04:34:24Z)
Making a Case for 3D Convolutions for Object Segmentation in Videos [16.167397418720483]
We show that 3D convolutional networks can be effectively applied to dense video prediction tasks such as salient object segmentation. We propose a 3D decoder architecture, that comprises novel 3D Global Convolution layers and 3D Refinement modules. Our approach outperforms existing state-of-the-arts by a large margin on the DAVIS'16 Unsupervised, FBMS and ViSal benchmarks.
arXiv Detail & Related papers (2020-08-26T12:24:23Z)
Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution [34.713667358316286]
Self-driving cars need to understand 3D scenes efficiently and accurately in order to drive safely. Existing 3D perception models are not able to recognize small instances very well due to the low-resolution voxelization and aggressive downsampling. We propose Sparse Point-Voxel Convolution (SPVConv), a lightweight 3D module that equips the vanilla Sparse Convolution with the high-resolution point-based branch.
arXiv Detail & Related papers (2020-07-31T14:27:27Z)
X3D: Expanding Architectures for Efficient Video Recognition [21.539880641349693]
X3D is a family of efficient video networks that progressively expand a tiny 2D image classification architecture. Inspired by feature selection methods in machine learning, a simple stepwise network expansion approach is employed. We report competitive accuracy at unprecedented efficiency on video classification and detection benchmarks.
arXiv Detail & Related papers (2020-04-09T17:59:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.