Fugu-MT 論文翻訳(概要): PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation

論文の概要: PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation

arxiv url: http://arxiv.org/abs/2603.11917v1
Date: Thu, 12 Mar 2026 13:31:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:26.108746
Title: PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation
Title（参考訳）: PicoSAM3: リアルタイムインセンサー領域-関心領域セグメンテーション
Authors: Pietro Bonazzi, Nicola Farronato, Stefan Zihlmann, Haotong Qin, Michele Magno,
Abstract要約: 我々はエッジとインセンサーの実行に最適化された軽量なプロンプト可能な視覚分割モデルPicoSAM3を紹介する。 PicoSAM3は1.3Mパラメータを持ち、密度の高いCNNアーキテクチャと、エンコーディングの領域、効率的なチャネル注意、SAM2とSAM3からの知識蒸留を組み合わせた。 COCOとLVISでは、PicoSAM3は65.45%と64.01% mIoUをそれぞれ達成し、既存のSAMベースとエッジ指向のベースラインを上回っている。
参考スコア（独自算出の注目度）: 22.190837932060607
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications such as smart glasses and Internet-of-Things devices. We introduce PicoSAM3, a lightweight promptable visual segmentation model optimized for edge and in-sensor execution, including deployment on the Sony IMX500 vision sensor. PicoSAM3 has 1.3 M parameters and combines a dense CNN architecture with region of interest prompt encoding, Efficient Channel Attention, and knowledge distillation from SAM2 and SAM3. On COCO and LVIS, PicoSAM3 achieves 65.45% and 64.01% mIoU, respectively, outperforming existing SAM-based and edge-oriented baselines at similar or lower complexity. The INT8 quantized model preserves accuracy with negligible degradation while enabling real-time in-sensor inference at 11.82 ms latency on the IMX500, fully complying with its memory and operator constraints. Ablation studies show that distillation from large SAM models yields up to +14.5% mIoU improvement over supervised training and demonstrate that high-quality, spatially flexible promptable segmentation is feasible directly at the sensor level.
Abstract（参考訳）: リアルタイムのオンデバイスセグメンテーションは、スマートグラスやInternet-of-Thingsデバイスのようなレイテンシに敏感でプライバシに配慮したアプリケーションには不可欠である。我々は、Sony IMX500ビジョンセンサーへの展開を含むエッジとインセンサーの実行に最適化された軽量なプロンプト可能なビジュアルセグメンテーションモデルであるPicoSAM3を紹介する。 PicoSAM3は1.3Mパラメータを持ち、密度の高いCNNアーキテクチャと、エンコーディングの領域、効率的なチャネル注意、SAM2とSAM3からの知識蒸留を組み合わせた。 COCOとLVISでは、PicoSAM3は65.45%と64.01% mIoUをそれぞれ達成し、既存のSAMベースベースラインとエッジ指向ベースラインを同様のまたは低い複雑さで上回っている。 INT8量子化モデルは、IMX500のリアルタイムインセンサー推論を11.82msの遅延で可能にし、メモリと演算子の制約を完全に満たしながら、無視可能な劣化で精度を保っている。アブレーション研究では、大きなSAMモデルからの蒸留は教師付きトレーニングよりも+14.5% mIoUの改善をもたらすことが示され、高品質で空間的に柔軟なセグメンテーションがセンサーレベルで直接実現可能であることを示した。

論文の概要: PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation

関連論文リスト