Lightweight Quad Bayer HybridEVS Demosaicing via State Space Augmented Cross-Attention
- URL: http://arxiv.org/abs/2508.06058v1
- Date: Fri, 08 Aug 2025 06:40:59 GMT
- Title: Lightweight Quad Bayer HybridEVS Demosaicing via State Space Augmented Cross-Attention
- Authors: Shiyang Zhou, Haijin Zeng, Yunfan Lu, Yongyong Chen, Jie Liu, Jingyong Su,
- Abstract summary: Event cameras like the Hybrid Event-based Vision Sensor (HybridEVS) camera capture brightness changes as asynchronous "events" instead of frames, offering advanced application on mobile photography.<n> challenges arise from combining a Quad Bayer Color Filter Array (CFA) sensor with event pixels lacking color information, resulting in aliasing and artifacts on the demosaicing process before downstream application.<n>We introduce textbfTSANet, a lightweight textbfTwo-stage network via textbfState space augmented cross-textbfAttention, which can handle event pixels in
- Score: 14.330556564038401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Event cameras like the Hybrid Event-based Vision Sensor (HybridEVS) camera capture brightness changes as asynchronous "events" instead of frames, offering advanced application on mobile photography. However, challenges arise from combining a Quad Bayer Color Filter Array (CFA) sensor with event pixels lacking color information, resulting in aliasing and artifacts on the demosaicing process before downstream application. Current methods struggle to address these issues, especially on resource-limited mobile devices. In response, we introduce \textbf{TSANet}, a lightweight \textbf{T}wo-stage network via \textbf{S}tate space augmented cross-\textbf{A}ttention, which can handle event pixels inpainting and demosaicing separately, leveraging the benefits of dividing complex tasks into manageable subtasks. Furthermore, we introduce a lightweight Cross-Swin State Block that uniquely utilizes positional prior for demosaicing and enhances global dependencies through the state space model with linear complexity. In summary, TSANet demonstrates excellent demosaicing performance on both simulated and real data of HybridEVS while maintaining a lightweight model, averaging better results than the previous state-of-the-art method DemosaicFormer across seven diverse datasets in both PSNR and SSIM, while respectively reducing parameter and computation costs by $1.86\times$ and $3.29\times$. Our approach presents new possibilities for efficient image demosaicing on mobile devices. Code is available in the supplementary materials.
Related papers
- MambaSeg: Harnessing Mamba for Accurate and Efficient Image-Event Semantic Segmentation [17.515348703686232]
We propose MambaSeg, a novel dual-branch semantic segmentation framework that employs parallel Mamba encoders to efficiently model RGB images and event streams.<n>MambaSeg achieves state-of-the-art segmentation performance while significantly reducing computational cost, showcasing its promise for efficient, scalable, and robust multimodal perception.
arXiv Detail & Related papers (2025-12-30T14:09:17Z) - Beyond conventional vision: RGB-event fusion for robust object detection in dynamic traffic scenarios [23.41380544271609]
Dynamic range of conventional RGB cameras reduces global contrast and causes loss of high-frequency details.<n>We propose a motion cue fusion network (MCFNet) which achieves optimal cross-modal feature fusion under challenging lighting.<n>MCFNet significantly outperforms existing methods in various poor lighting and fast moving traffic scenarios.
arXiv Detail & Related papers (2025-08-14T14:48:21Z) - FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [63.87313550399871]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose Self-supervised Transfer (PST) and FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.
arXiv Detail & Related papers (2025-03-25T15:04:53Z) - PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution [95.98801201266099]
Diffusion-based image super-resolution (SR) models have shown superior performance at the cost of multiple denoising steps.<n>We propose a novel post-training quantization approach with adaptive scale in one-step diffusion (OSD) image SR, PassionSR.<n>Our PassionSR achieves significant advantages over recent leading low-bit quantization methods for image SR.
arXiv Detail & Related papers (2024-11-26T04:49:42Z) - CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [73.80247057590519]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability.<n>We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications.<n>Our model achieves 83.0%/84.1% top-1 with only 12M/21M parameters on ImageNet-1K.
arXiv Detail & Related papers (2024-08-07T11:33:46Z) - DemosaicFormer: Coarse-to-Fine Demosaicing Network for HybridEVS Camera [70.28702677370879]
Hybrid Event-Based Vision Sensor (HybridEVS) is a novel sensor integrating traditional frame-based and event-based sensors.
Despite its potential, the lack of Image signal processing (ISP) pipeline specifically designed for HybridEVS poses a significant challenge.
We propose a coarse-to-fine framework named DemosaicFormer which comprises coarse demosaicing and pixel correction.
arXiv Detail & Related papers (2024-06-12T07:20:46Z) - Homography Guided Temporal Fusion for Road Line and Marking Segmentation [73.47092021519245]
Road lines and markings are frequently occluded in the presence of moving vehicles, shadow, and glare.
We propose a Homography Guided Fusion (HomoFusion) module to exploit temporally-adjacent video frames for complementary cues.
We show that exploiting available camera intrinsic data and ground plane assumption for cross-frame correspondence can lead to a light-weight network with significantly improved performances in speed and accuracy.
arXiv Detail & Related papers (2024-04-11T10:26:40Z) - Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR)
CFSR inherits the advantages of both convolution-based and transformer-based approaches.
Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z) - ToddlerDiffusion: Interactive Structured Image Generation with Cascaded Schrödinger Bridge [63.00793292863]
ToddlerDiffusion is a novel approach to decomposing the complex task of RGB image generation into simpler, interpretable stages.
Our method, termed ToddlerDiffusion, cascades modality-specific models, each responsible for generating an intermediate representation.
ToddlerDiffusion consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-11-24T15:20:01Z) - Learning Parallax for Stereo Event-based Motion Deblurring [8.201943408103995]
Existing approaches rely on the perfect pixel-wise alignment between intensity images and events, which is not always fulfilled in the real world.
We propose a novel coarse-to-fine framework, named NETwork of Event-based motion Deblurring with STereo event and intensity cameras (St-EDNet)
We build a new dataset with STereo Event and Intensity Cameras (StEIC), containing real-world events, intensity images, and dense disparity maps.
arXiv Detail & Related papers (2023-09-18T06:51:41Z) - HALSIE: Hybrid Approach to Learning Segmentation by Simultaneously
Exploiting Image and Event Modalities [6.543272301133159]
Event cameras detect changes in per-pixel intensity to generate asynchronous event streams.
They offer great potential for accurate semantic map retrieval in real-time autonomous systems.
Existing implementations for event segmentation suffer from sub-based performance.
We propose hybrid end-to-end learning framework HALSIE to reduce inference cost by up to $20times$ versus art.
arXiv Detail & Related papers (2022-11-19T17:09:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.