Fugu-MT 論文翻訳(概要): IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline

論文の概要: IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline

arxiv url: http://arxiv.org/abs/2604.02032v1
Date: Thu, 02 Apr 2026 13:38:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:10.821166
Title: IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline
Title（参考訳）: IndoorCrowd: 自動アノテーションパイプラインによる人検出、セグメンテーション、追跡のためのマルチシーンデータセット
Authors: Sebastian-Ion Nae, Radu Moldoveanu, Alexandra Stefania Ghita, Adina Magda Florea,
Abstract要約: IndoorCrowdは、屋内での人間検出、インスタンスセグメンテーション、マルチオブジェクトトラッキングのためのデータセットである。お値段は31ドル(約3,300円)で、人間認証されたインスタンスごとのセグメンテーションマスクがついている。
参考スコア（独自算出の注目度）: 39.799207552858114
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding human behaviour in crowded indoor environments is central to surveillance, smart buildings, and human-robot interaction, yet existing datasets rarely capture real-world indoor complexity at scale. We introduce IndoorCrowd, a multi-scene dataset for indoor human detection, instance segmentation, and multi-object tracking, collected across four campus locations (ACS-EC, ACS-EG, IE-Central, R-Central). It comprises $31$ videos ($9{,}913$ frames at $5$fps) with human-verified, per-instance segmentation masks. A $620$-frame control subset benchmarks three foundation-model auto-annotators: SAM3, GroundingSAM, and EfficientGroundingSAM, against human labels using Cohen's $κ$, AP, precision, recall, and mask IoU. A further $2{,}552$-frame subset supports multi-object tracking with continuous identity tracks in MOTChallenge format. We establish detection, segmentation, and tracking baselines using YOLOv8n, YOLOv26n, and RT-DETR-L paired with ByteTrack, BoT-SORT, and OC-SORT. Per-scene analysis reveals substantial difficulty variation driven by crowd density, scale, and occlusion: ACS-EC, with $79.3\%$ dense frames and a mean instance scale of $60.8$px, is the most challenging scene. The project page is available at https://sheepseb.github.io/IndoorCrowd/.
Abstract（参考訳）: 混雑した屋内環境での人間の振る舞いを理解することは、監視、スマートな建物、人間とロボットの相互作用の中心であるが、既存のデータセットは、大規模な屋内での実際の複雑さを捉えることはめったにない。 IndoorCrowdは屋内での人間検出、インスタンスセグメンテーション、多目的追跡のためのマルチシーンデータセットで、4つのキャンパス(ACS-EC, ACS-EG, IE-Central, R-Central)にまたがって収集される。ビデオは311ドル($9{,}913$ frames at 5$fps)で、人間認証されたインスタンスごとのセグメンテーションマスクがある。 620ドルのフレームコントロールサブセットは、Chenの$κ$、AP、精度、リコール、マスクIoUを使った人間のラベルに対してSAM3、GroundingSAM、EfficientGroundingSAMの3つのファンデーションモデルオートアノテーションをベンチマークする。さらに${,}552$-frameサブセットは、MOTChallengeフォーマットで連続的なアイデンティティトラックを備えたマルチオブジェクトトラッキングをサポートする。我々は ByteTrack, BoT-SORT, OC-SORT と組み合わせて YOLOv8n, YOLOv26n, RT-DETR-L を用いて検出, セグメンテーション, 追跡ベースラインを確立する。 ACS-ECは79.3\%の高密度フレームを持ち、平均インスタンススケールは60.8$pxであり、最も難しいシーンである。プロジェクトページはhttps://sheepseb.github.io/IndoorCrowd/.comで公開されている。

論文の概要: IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline

関連論文リスト