AdaBrowse: Adaptive Video Browser for Efficient Continuous Sign Language
Recognition
- URL: http://arxiv.org/abs/2308.08327v1
- Date: Wed, 16 Aug 2023 12:40:47 GMT
- Title: AdaBrowse: Adaptive Video Browser for Efficient Continuous Sign Language
Recognition
- Authors: Lianyu Hu, Liqing Gao, Zekang Liu, Chi-Man Pun, Wei Feng
- Abstract summary: We propose a novel model (AdaBrowse) to dynamically select a most informative subsequence from input video sequences.
AdaBrowse achieves comparable accuracy with state-of-the-art methods with 1.44$times$ throughput and 2.12$times$ fewer FLOPs.
- Score: 39.778958624066185
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Raw videos have been proven to own considerable feature redundancy where in
many cases only a portion of frames can already meet the requirements for
accurate recognition. In this paper, we are interested in whether such
redundancy can be effectively leveraged to facilitate efficient inference in
continuous sign language recognition (CSLR). We propose a novel adaptive model
(AdaBrowse) to dynamically select a most informative subsequence from input
video sequences by modelling this problem as a sequential decision task. In
specific, we first utilize a lightweight network to quickly scan input videos
to extract coarse features. Then these features are fed into a policy network
to intelligently select a subsequence to process. The corresponding subsequence
is finally inferred by a normal CSLR model for sentence prediction. As only a
portion of frames are processed in this procedure, the total computations can
be considerably saved. Besides temporal redundancy, we are also interested in
whether the inherent spatial redundancy can be seamlessly integrated together
to achieve further efficiency, i.e., dynamically selecting a lowest input
resolution for each sample, whose model is referred to as AdaBrowse+. Extensive
experimental results on four large-scale CSLR datasets, i.e., PHOENIX14,
PHOENIX14-T, CSL-Daily and CSL, demonstrate the effectiveness of AdaBrowse and
AdaBrowse+ by achieving comparable accuracy with state-of-the-art methods with
1.44$\times$ throughput and 2.12$\times$ fewer FLOPs. Comparisons with other
commonly-used 2D CNNs and adaptive efficient methods verify the effectiveness
of AdaBrowse. Code is available at
\url{https://github.com/hulianyuyy/AdaBrowse}.
Related papers
- Fast Deep Predictive Coding Networks for Videos Feature Extraction without Labels [2.554431612189437]
Deep predictive coding networks (DPCNs) capture video features through a bi-directional information flow.
This paper proposes a DPCN with a fast inference of internal model variables that achieves high sparsity and accuracy of feature clustering.
Experiments in the data sets CIFAR-10, Super Mario Bros video game, and Coil-100 validate the approach.
arXiv Detail & Related papers (2024-09-08T01:53:25Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control [66.78146440275093]
Learned retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors.
We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval.
Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets.
Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors.
arXiv Detail & Related papers (2024-02-27T14:21:56Z) - Efficient Person Search: An Anchor-Free Approach [86.45858994806471]
Person search aims to simultaneously localize and identify a query person from realistic, uncropped images.
To achieve this goal, state-of-the-art models typically add a re-id branch upon two-stage detectors like Faster R-CNN.
In this work, we present an anchor-free approach to efficiently tackling this challenging task, by introducing the following dedicated designs.
arXiv Detail & Related papers (2021-09-01T07:01:33Z) - Improved CNN-based Learning of Interpolation Filters for Low-Complexity
Inter Prediction in Video Coding [5.46121027847413]
This paper introduces a novel explainable neural network-based inter-prediction scheme.
A novel training framework enables each network branch to resemble a specific fractional shift.
When implemented in the context of the Versatile Video Coding (VVC) test model, 0.77%, 1.27% and 2.25% BD-rate savings can be achieved.
arXiv Detail & Related papers (2021-06-16T16:48:01Z) - Adaptive Focus for Efficient Video Recognition [29.615394426035074]
We propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus)
A light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions.
During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices.
arXiv Detail & Related papers (2021-05-07T13:24:47Z) - Boosting Continuous Sign Language Recognition via Cross Modality
Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair.
We propose a novel architecture with cross modality augmentation.
The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z) - Approximated Bilinear Modules for Temporal Modeling [116.6506871576514]
Two-layers in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch sampling.
Our models can outperform most state-of-the-art methods on SomethingSomething v1 and v2 datasets without pretraining.
arXiv Detail & Related papers (2020-07-25T09:07:35Z) - FarSee-Net: Real-Time Semantic Segmentation by Efficient Multi-scale
Context Aggregation and Feature Space Super-resolution [14.226301825772174]
We introduce a novel and efficient module called Cascaded Factorized Atrous Spatial Pyramid Pooling (CF-ASPP)
It is a lightweight cascaded structure for Convolutional Neural Networks (CNNs) to efficiently leverage context information.
We achieve 68.4% mIoU at 84 fps on the Cityscapes test set with a single Nivida Titan X (Maxwell) GPU card.
arXiv Detail & Related papers (2020-03-09T03:53:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.