E^2VTS: Energy-Efficient Video Text Spotting from Unmanned Aerial
Vehicles
- URL: http://arxiv.org/abs/2206.02281v1
- Date: Sun, 5 Jun 2022 22:43:17 GMT
- Title: E^2VTS: Energy-Efficient Video Text Spotting from Unmanned Aerial
Vehicles
- Authors: Zhenyu Hu, Zhenyu Wu, Pengcheng Pi, Yunhe Xue, Jiayi Shen, Jianchao
Tan, Xiangru Lian, Zhangyang Wang, and Ji Liu
- Abstract summary: Unmanned Aerial Vehicles (UAVs) based video text spotting has been extensively used in civil and military domains.
Our proposed energy-efficient video text spotting solution, dubbed as E2VTS, outperforms all previous methods by achieving a competitive tradeoff between energy efficiency and performance.
- Score: 78.50123964690094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unmanned Aerial Vehicles (UAVs) based video text spotting has been
extensively used in civil and military domains. UAV's limited battery capacity
motivates us to develop an energy-efficient video text spotting solution. In
this paper, we first revisit RCNN's crop & resize training strategy and
empirically find that it outperforms aligned RoI sampling on a real-world video
text dataset captured by UAV. To reduce energy consumption, we further propose
a multi-stage image processor that takes videos' redundancy, continuity, and
mixed degradation into account. Lastly, the model is pruned and quantized
before deployed on Raspberry Pi. Our proposed energy-efficient video text
spotting solution, dubbed as E^2VTS, outperforms all previous methods by
achieving a competitive tradeoff between energy efficiency and performance. All
our codes and pre-trained models are available at
https://github.com/wuzhenyusjtu/LPCVC20-VideoTextSpotting.
Related papers
- Streamlining Forest Wildfire Surveillance: AI-Enhanced UAVs Utilizing the FLAME Aerial Video Dataset for Lightweight and Efficient Monitoring [4.303063757163241]
This study recognizes the imperative for real-time data processing in disaster response scenarios and introduces a lightweight and efficient approach for aerial video understanding.
Our methodology identifies redundant portions within the video through policy networks and eliminates this excess information using frame compression techniques.
Compared to the baseline, our approach reduces computation costs by more than 13 times while boosting accuracy by 3$%$.
arXiv Detail & Related papers (2024-08-31T17:26:53Z) - BEVCar: Camera-Radar Fusion for BEV Map and Object Segmentation [22.870994478494566]
We introduce BEVCar, a novel approach for joint BEV object and map segmentation.
The core novelty of our approach lies in first learning a point-based encoding of raw radar data.
We show that incorporating radar information significantly enhances robustness in challenging environmental conditions.
arXiv Detail & Related papers (2024-03-18T13:14:46Z) - Compressing Vision Transformers for Low-Resource Visual Learning [7.662469543657508]
Vision transformer (ViT) and its variants offer state-of-the-art accuracy in tasks such as image classification, object detection, and semantic segmentation.
These models are large and computation-heavy, making their deployment on mobile and edge scenarios limited.
We aim to take a step toward bringing vision transformers to the edge by utilizing popular model compression techniques such as distillation, pruning, and quantization.
arXiv Detail & Related papers (2023-09-05T23:33:39Z) - TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
Scale [59.01246141215051]
We analyze the factor that leads to degradation from the perspective of language supervision.
We propose a tunable-free pre-training strategy to retain the generalization ability of the text encoder.
We produce a series of models, dubbed TVTSv2, with up to one billion parameters.
arXiv Detail & Related papers (2023-05-23T15:44:56Z) - Learning to Compress Unmanned Aerial Vehicle (UAV) Captured Video:
Benchmark and Analysis [54.07535860237662]
We propose a novel task for learned UAV video coding and construct a comprehensive and systematic benchmark for such a task.
It is expected that the benchmark will accelerate the research and development in video coding on drone platforms.
arXiv Detail & Related papers (2023-01-15T15:18:02Z) - EgoDistill: Egocentric Head Motion Distillation for Efficient Video
Understanding [90.9111678470214]
We propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features.
Our method leads to significant improvements in efficiency, requiring 200x fewer GFLOPs than equivalent video models.
We demonstrate its effectiveness on the Ego4D and EPICKitchens datasets, where our method outperforms state-of-the-art efficient video understanding methods.
arXiv Detail & Related papers (2023-01-05T18:39:23Z) - TransVisDrone: Spatio-Temporal Transformer for Vision-based
Drone-to-Drone Detection in Aerial Videos [57.92385818430939]
Drone-to-drone detection using visual feed has crucial applications, such as detecting drone collisions, detecting drone attacks, or coordinating flight with other drones.
Existing methods are computationally costly, follow non-end-to-end optimization, and have complex multi-stage pipelines, making them less suitable for real-time deployment on edge devices.
We propose a simple yet effective framework, itTransVisDrone, that provides an end-to-end solution with higher computational efficiency.
arXiv Detail & Related papers (2022-10-16T03:05:13Z) - A Simple Baseline for BEV Perception Without LiDAR [37.00868568802673]
Building 3D perception systems for autonomous vehicles that do not rely on LiDAR is a critical research problem.
Current methods use multi-view RGB data collected from cameras around the vehicle.
We propose a simple baseline model, where the "lifting" step simply averages features from all projected image locations.
arXiv Detail & Related papers (2022-06-16T06:57:32Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.