PicoSAM2: Low-Latency Segmentation In-Sensor for Edge Vision Applications
- URL: http://arxiv.org/abs/2506.18807v2
- Date: Tue, 24 Jun 2025 09:56:22 GMT
- Title: PicoSAM2: Low-Latency Segmentation In-Sensor for Edge Vision Applications
- Authors: Pietro Bonazzi, Nicola Farronato, Stefan Zihlmann, Haotong Qin, Michele Magno,
- Abstract summary: PicoSAM2, a lightweight (1.3M parameters, 336M MACs) promptable segmentation model optimized for edge and in-sensor execution, including the Sony IMX500.<n>On COCO and LVIS, it achieves 51.9% and 44.9% mIoU, respectively.<n>The quantized model (1.22MB) runs at 14.3 ms on the IMX500-achieving 86 MACs/cycle, making it the only model meeting both memory and compute constraints for in-sensor deployment.
- Score: 10.20223636234956
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications like smart glasses and IoT devices. We introduce PicoSAM2, a lightweight (1.3M parameters, 336M MACs) promptable segmentation model optimized for edge and in-sensor execution, including the Sony IMX500. It builds on a depthwise separable U-Net, with knowledge distillation and fixed-point prompt encoding to learn from the Segment Anything Model 2 (SAM2). On COCO and LVIS, it achieves 51.9% and 44.9% mIoU, respectively. The quantized model (1.22MB) runs at 14.3 ms on the IMX500-achieving 86 MACs/cycle, making it the only model meeting both memory and compute constraints for in-sensor deployment. Distillation boosts LVIS performance by +3.5% mIoU and +5.1% mAP. These results demonstrate that efficient, promptable segmentation is feasible directly on-camera, enabling privacy-preserving vision without cloud or host processing.
Related papers
- Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation [129.45368843861917]
We introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers.<n>We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs to share memory readout states from a Samba-based self-decoder.
arXiv Detail & Related papers (2025-07-09T07:27:00Z) - EdgeTAM: On-Device Track Anything Model [65.10032957471824]
Segment Anything Model (SAM) 2 further extends its capability from image to video inputs through a memory bank mechanism.<n>We aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance.<n>We propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost.
arXiv Detail & Related papers (2025-01-13T12:11:07Z) - EMOv2: Pushing 5M Vision Model Frontier [92.21687467702972]
We set up the new frontier of the 5M magnitude lightweight model on various downstream tasks.<n>Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer.<n>Considering the imperceptible latency for mobile users when downloading models under 4G/5G bandwidth, we investigate the performance upper limit of lightweight models with a magnitude of 5M.
arXiv Detail & Related papers (2024-12-09T17:12:22Z) - SAVE: Segment Audio-Visual Easy way using Segment Anything Model [0.0]
This study presents a lightweight approach, SAVE, which efficiently adapts the pre-trained segment anything model (SAM) to the AVS task.
Our proposed model achieves effective audio-visual fusion and interaction during the encoding stage.
arXiv Detail & Related papers (2024-07-02T07:22:28Z) - ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding [9.144813021145039]
This paper introduces ParFormer, a vision transformer that incorporates a Parallel Mixer and a Sparse Channel Attention Patch Embedding (SCAPE)
ParFormer improves feature extraction by combining convolutional and attention mechanisms.
For edge device deployment, ParFormer-T excels with a throughput of 278.1 images/sec, which is 1.38 $times$ higher than EdgeNeXt-S.
The larger variant, ParFormer-L, reaches 83.5% Top-1 accuracy, offering a balanced trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2024-03-22T07:32:21Z) - Q-Segment: Segmenting Images In-Sensor for Vessel-Based Medical
Diagnosis [13.018482089796159]
We present "Q-Segment", a quantized real-time segmentation algorithm, and conduct a comprehensive evaluation on a low-power edge vision platform with the Sony IMX500.
Q-Segment achieves ultra-low inference time in-sensor only 0.23 ms and power consumption of only 72mW.
This research contributes valuable insights into edge-based image segmentation, laying the foundation for efficient algorithms tailored to low-power environments.
arXiv Detail & Related papers (2023-12-15T15:01:41Z) - EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM [71.868623296582]
EdgeSAM is an accelerated variant of the Segment Anything Model (SAM)
Our approach involves distilling the original ViT-based SAM image encoder into a purely CNN-based architecture.
It is the first SAM variant that can run at over 30 FPS on an iPhone 14.
arXiv Detail & Related papers (2023-12-11T18:59:52Z) - TinyTracker: Ultra-Fast and Ultra-Low-Power Edge Vision In-Sensor for
Gaze Estimation [11.917014372788584]
This work leverages one of the first "AI in sensor" vision platforms, IMX500 by Sony, to achieve ultra-fast and ultra-low-power end-to-end edge vision applications.
We propose TinyTracker, a highly efficient, fully quantized model for 2D gaze estimation designed to maximize the performance of the edge vision systems considered in this study.
arXiv Detail & Related papers (2023-07-15T14:34:25Z) - You Only Segment Once: Towards Real-Time Panoptic Segmentation [68.91492389185744]
YOSO is a real-time panoptic segmentation framework.
YOSO predicts masks via dynamic convolutions between panoptic kernels and image feature maps.
YOSO achieves 46.4 PQ, 45.6 FPS on COCO; 52.5 PQ, 22.6 FPS on Cityscapes; 38.0 PQ, 35.4 FPS on ADE20K.
arXiv Detail & Related papers (2023-03-26T07:55:35Z) - EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for
Mobile Vision Applications [68.35683849098105]
We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups.
Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K.
Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-06-21T17:59:56Z) - PP-PicoDet: A Better Real-Time Object Detector on Mobile Devices [13.62426382827205]
PP-PicoDet family of real-time object detectors achieves superior performance on object detection for mobile devices.
Models achieve better trade-offs between accuracy and latency compared to other popular models.
arXiv Detail & Related papers (2021-11-01T12:53:17Z) - YOLO-ReT: Towards High Accuracy Real-time Object Detection on Edge GPUs [14.85882314822983]
In order to map deep neural network (DNN) based object detection models to edge devices, one typically needs to compress such models significantly.
In this paper, we propose a novel edge GPU friendly module for multi-scale feature interaction.
We also propose a novel learning backbone adoption inspired by the changing translational information flow across various tasks.
arXiv Detail & Related papers (2021-10-26T14:02:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.