Leveraging Motion Estimation for Efficient Bayer-Domain Computer Vision
- URL: http://arxiv.org/abs/2501.15119v2
- Date: Thu, 14 Aug 2025 02:43:14 GMT
- Title: Leveraging Motion Estimation for Efficient Bayer-Domain Computer Vision
- Authors: Haichao Wang, Xinyue Xi, Jiangtao Wen, Yuxing Han,
- Abstract summary: Existing computer vision processing pipeline acquires visual information using an image sensor that captures pixel information in the Bayer pattern.<n>The raw sensor data are then processed using an image signal processor (ISP) that first converts Bayer pixel data to RGB on a pixel by pixel basis, followed by video convolutional network (VCN) processing on a frame by frame basis.<n>We propose a novel framework that eliminates the ISP and leverages motion estimation to accelerate video vision tasks directly in the Bayer domain.
- Score: 12.940116042097847
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Existing computer vision processing pipeline acquires visual information using an image sensor that captures pixel information in the Bayer pattern. The raw sensor data are then processed using an image signal processor (ISP) that first converts Bayer pixel data to RGB on a pixel by pixel basis, followed by video convolutional network (VCN) processing on a frame by frame basis. Both ISP and VCN are computationally expensive with high power consumption and latency. In this paper, we propose a novel framework that eliminates the ISP and leverages motion estimation to accelerate video vision tasks directly in the Bayer domain. We introduce Motion Estimation-based Video Convolution (MEVC), which integrates sliding-window motion estimation into each convolutional layer, enabling prediction and residual-based refinement that reduces redundant computations across frames. This design bridges the structural gap between block-based motion estimation and spatial convolution, enabling accurate, low-cost processing. Our end-to-end pipeline supports raw Bayer input and achieves over 70\% reduction in FLOPs with minimal accuracy degradation across video semantic segmentation, depth estimation, and object detection benchmarks, using both synthetic Bayer-converted and real Bayer video datasets. This framework generalizes across convolution-based models and marks the first effective reuse of motion estimation for accelerating video computer vision directly from raw sensor data.
Related papers
- BiVM: Accurate Binarized Neural Network for Efficient Video Matting [56.000594826508504]
Deep neural networks for real-time video matting suffer significant computational limitations on edge devices.<n>We present BiVM, an accurate and resource-efficient Binarized neural network for Video Matting.<n>BiVM surpasses alternative binarized video matting networks, including state-of-the-art (SOTA) binarization methods, by a substantial margin.
arXiv Detail & Related papers (2025-07-06T16:32:37Z) - Towards Efficient Real-Time Video Motion Transfer via Generative Time Series Modeling [7.3949576464066]
We propose a deep learning framework designed to significantly optimize bandwidth for motion-transfer-enabled video applications.<n>To capture complex motion effectively, we utilize the First Order Motion Model (FOMM), which encodes dynamic objects by detecting keypoints.<n>We validate our results across three datasets for video animation and reconstruction using the following metrics: Mean Absolute Error, Joint Embedding Predictive Architecture Embedding Distance, Structural Similarity Index, and Average Pair-wise Displacement.
arXiv Detail & Related papers (2025-04-07T22:21:54Z) - Efficient Visual State Space Model for Image Deblurring [99.54894198086852]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration.<n>We propose a simple yet effective visual state space model (EVSSM) for image deblurring.<n>The proposed EVSSM performs favorably against state-of-the-art methods on benchmark datasets and real-world images.
arXiv Detail & Related papers (2024-05-23T09:13:36Z) - Enhancing Bandwidth Efficiency for Video Motion Transfer Applications using Deep Learning Based Keypoint Prediction [4.60378493357739]
We propose a deep learning based novel prediction framework for enhanced bandwidth reduction in motion transfer enabled video applications.
For real-time applications, our results show the effectiveness of our proposed architecture by enabling up to 2x additional bandwidth reduction.
arXiv Detail & Related papers (2024-03-17T20:36:43Z) - Neuromorphic Synergy for Video Binarization [54.195375576583864]
Bimodal objects serve as a visual form to embed information that can be easily recognized by vision systems.
Neuromorphic cameras offer new capabilities for alleviating motion blur, but it is non-trivial to first de-blur and then binarize the images in a real-time manner.
We propose an event-based binary reconstruction method that leverages the prior knowledge of the bimodal target's properties to perform inference independently in both event space and image space.
We also develop an efficient integration method to propagate this binary image to high frame rate binary video.
arXiv Detail & Related papers (2024-02-20T01:43:51Z) - Dynamic Frame Interpolation in Wavelet Domain [57.25341639095404]
Video frame is an important low-level computation vision task, which can increase frame rate for more fluent visual experience.
Existing methods have achieved great success by employing advanced motion models and synthesis networks.
WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts.
arXiv Detail & Related papers (2023-09-07T06:41:15Z) - EventTransAct: A video transformer-based framework for Event-camera
based action recognition [52.537021302246664]
Event cameras offer new opportunities compared to standard action recognition in RGB videos.
In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame.
In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($mathcalL_EC$) and event-specific augmentations.
arXiv Detail & Related papers (2023-08-25T23:51:07Z) - VNVC: A Versatile Neural Video Coding Framework for Efficient
Human-Machine Vision [59.632286735304156]
It is more efficient to enhance/analyze the coded representations directly without decoding them into pixels.
We propose a versatile neural video coding (VNVC) framework, which targets learning compact representations to support both reconstruction and direct enhancement/analysis.
arXiv Detail & Related papers (2023-06-19T03:04:57Z) - RN-Net: Reservoir Nodes-Enabled Neuromorphic Vision Sensing Network [7.112892720740359]
Event-based cameras are inspired by spiking and asynchronous spike representation of the biological visual system.
We propose a neural network architecture, based on simple convolution layers integrated with dynamic temporal encoding for local and global reservoirs.
RN-Net achieves the highest accuracy of 99.2% for DV128 Gesture reported to date, and one of the highest accuracy of 67.5% for DVS Lip dataset at a much smaller network size.
arXiv Detail & Related papers (2023-03-19T21:20:45Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Signal Processing for Implicit Neural Representations [80.38097216996164]
Implicit Neural Representations (INRs) encode continuous multi-media data via multi-layer perceptrons.
Existing works manipulate such continuous representations via processing on their discretized instance.
We propose an implicit neural signal processing network, dubbed INSP-Net, via differential operators on INR.
arXiv Detail & Related papers (2022-10-17T06:29:07Z) - Scalable Neural Video Representations with Learnable Positional Features [73.51591757726493]
We show how to train neural representations with learnable positional features (NVP) that effectively amortize a video as latent codes.
We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07rightarrow$34.57 (measured with the PSNR metric)
arXiv Detail & Related papers (2022-10-13T08:15:08Z) - Enabling ISP-less Low-Power Computer Vision [4.102254385058941]
We release the raw version of a large-scale benchmark for generic high-level vision tasks.
For ISP-less CV systems, training on raw images result in a 7.1% increase in test accuracy.
We propose an energy-efficient form of analog in-pixel demosaicing that may be coupled with in-pixel CNN computations.
arXiv Detail & Related papers (2022-10-11T13:47:30Z) - Efficient Video Deblurring Guided by Motion Magnitude [37.25713728458234]
We propose a novel framework that utilizes the motion magnitude prior (MMP) as guidance for efficient deep video deblurring.
The MMP consists of both spatial and temporal blur level information, which can be further integrated into an efficient recurrent neural network (RNN) for video deblurring.
arXiv Detail & Related papers (2022-07-27T08:57:48Z) - RF-Photonic Deep Learning Processor with Shannon-Limited Data Movement [0.0]
Optical neural networks (ONNs) are promising accelerators with ultra-low latency and energy consumption.
We introduce our multiplicative analog frequency transform ONN (MAFT-ONN) that encodes the data in the frequency domain.
We experimentally demonstrate the first hardware accelerator that computes fully-analog deep learning on raw RF signals.
arXiv Detail & Related papers (2022-07-08T16:37:13Z) - P2M: A Processing-in-Pixel-in-Memory Paradigm for Resource-Constrained
TinyML Applications [4.102356304183255]
High-resolution input images still need to be streamed between the camera and the AI processing unit, frame by frame, causing energy, bandwidth, and security bottlenecks.
We propose a novel Processing-in-Pixel-in-memory (P2M) paradigm, that customizes the pixel array by adding support for analog multi-channel, multi-bit convolution and ReLU.
Our results indicate that P2M reduces data transfer bandwidth from sensors and analog to digital conversions by 21x, and the energy-delay product (EDP) incurred in processing a MobileNetV2 model on a TinyML
arXiv Detail & Related papers (2022-03-07T04:15:29Z) - Highly-Efficient Binary Neural Networks for Visual Place Recognition [24.674034243725455]
VPR is a fundamental task for autonomous navigation as it enables a robot to localize itself in the workspace when a known location is detected.
CNN-based techniques archive state-of-the-art VPR performance but are computationally intensive and energy demanding.
This paper presents a class of BNNs for VPR that combines depthwise separable factorization and binarization to replace the first convolutional layer.
arXiv Detail & Related papers (2022-02-24T22:05:11Z) - Neural Residual Flow Fields for Efficient Video Representations [5.904082461511478]
Implicit neural representation (INR) has emerged as a powerful paradigm for representing signals, such as images, videos, 3D shapes, etc.
We propose a novel INR approach to representing and compressing videos by explicitly removing data redundancy.
We show that the proposed method outperforms the baseline methods by a significant margin.
arXiv Detail & Related papers (2022-01-12T06:22:09Z) - Hybrid SNN-ANN: Energy-Efficient Classification and Object Detection for
Event-Based Vision [64.71260357476602]
Event-based vision sensors encode local pixel-wise brightness changes in streams of events rather than image frames.
Recent progress in object recognition from event-based sensors has come from conversions of deep neural networks.
We propose a hybrid architecture for end-to-end training of deep neural networks for event-based pattern recognition and object detection.
arXiv Detail & Related papers (2021-12-06T23:45:58Z) - VideoPose: Estimating 6D object pose from videos [14.210010379733017]
We introduce a simple yet effective algorithm that uses convolutional neural networks to directly estimate object poses from videos.
Our proposed network takes a pre-trained 2D object detector as input, and aggregates visual features through a recurrent neural network to make predictions at each frame.
Experimental evaluation on the YCB-Video dataset show that our approach is on par with the state-of-the-art algorithms.
arXiv Detail & Related papers (2021-11-20T20:57:45Z) - Dynamic Gesture Recognition [0.0]
It is possible to use machine learning to classify images and/or videos instead of the traditional computer vision algorithms.
The aim of this project is to builda symbiosis between a convolutional neural network (CNN) and a recurrent neural network (RNN)
arXiv Detail & Related papers (2021-09-20T09:45:29Z) - Personal Privacy Protection via Irrelevant Faces Tracking and Pixelation
in Video Live Streaming [61.145467627057194]
We develop a new method called Face Pixelation in Video Live Streaming to generate automatic personal privacy filtering.
For fast and accurate pixelation of irrelevant people's faces, FPVLS is organized in a frame-to-video structure of two core stages.
On the video live streaming dataset we collected, FPVLS obtains satisfying accuracy, real-time efficiency, and contains the over-pixelation problems.
arXiv Detail & Related papers (2021-01-04T16:18:26Z) - CNNs for JPEGs: A Study in Computational Cost [49.97673761305336]
Convolutional neural networks (CNNs) have achieved astonishing advances over the past decade.
CNNs are capable of learning robust representations of the data directly from the RGB pixels.
Deep learning methods capable of learning directly from the compressed domain have been gaining attention in recent years.
arXiv Detail & Related papers (2020-12-26T15:00:10Z) - Computational optimization of convolutional neural networks using
separated filters architecture [69.73393478582027]
We consider a convolutional neural network transformation that reduces computation complexity and thus speedups neural network processing.
Use of convolutional neural networks (CNN) is the standard approach to image recognition despite the fact they can be too computationally demanding.
arXiv Detail & Related papers (2020-02-18T17:42:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.