Fugu-MT 論文翻訳(概要): Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations

論文の概要: Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations

arxiv url: http://arxiv.org/abs/2603.08317v1
Date: Mon, 09 Mar 2026 12:38:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:15.989241
Title: Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations
Title（参考訳）: 空間的・時空間的操作によるエゴ中心行動認識におけるAIの多様性
Authors: Sadegh Rahmaniboldaji, Filip Rybansky, Quoc C. Vuong, Anya C. Hurlbert, Frank Guerin, Andrew Gilbert,
Abstract要約: 人間は、アクション認識における最先端のAIモデルよりも一貫して優れています。最小認識クロック(MIRC)を用いたegoの大規模人間-AI比較研究について述べる。我々は,MIRCsからサブMIRCsへの移行に伴って,ヒトのパフォーマンスが急激な低下を示すことを示した。
参考スコア（独自算出の注目度）: 12.465670388296239
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humans consistently outperform state-of-the-art AI models in action recognition, particularly in challenging real-world conditions involving low resolution, occlusion, and visual clutter. Understanding the sources of this performance gap is essential for developing more robust and human-aligned models. In this paper, we present a large-scale human-AI comparative study of egocentric action recognition using Minimal Identifiable Recognition Crops (MIRCs), defined as the smallest spatial or spatiotemporal regions sufficient for reliable human recognition. We used our previously introduced, Epic ReduAct, a systematically spatially reduced and temporally scrambled dataset derived from 36 EPIC KITCHENS videos, spanning multiple spatial reduction levels and temporal conditions. Recognition performance is evaluated using over 3,000 human participants and the Side4Video model. Our analysis combines quantitative metrics, Average Reduction Rate and Recognition Gap, with qualitative analyses of spatial (high-, mid-, and low-level visual features) and spatiotemporal factors, including a categorisation of actions into Low Temporal Actions (LTA) and High Temporal Actions (HTA). Results show that human performance exhibits sharp declines when transitioning from MIRCs to subMIRCs, reflecting a strong reliance on sparse, semantically critical cues such as hand-object interactions. In contrast, the model degrades more gradually and often relies on contextual and mid- to low-level features, sometimes even exhibiting increased confidence under spatial reduction. Temporally, humans remain robust to scrambling when key spatial cues are preserved, whereas the model often shows insensitivity to temporal disruption, revealing class-dependent temporal sensitivities.
Abstract（参考訳）: 人間は、アクション認識における最先端のAIモデル、特に低解像度、オクルージョン、視覚的クラッタを含む現実世界の状況において、一貫して上回る。このパフォーマンスギャップの原因を理解することは、より堅牢で人間に沿ったモデルを開発するために不可欠です。本稿では,人間認識に十分な空間的・時空間的領域を最小に定義したMIRC(Minimmal Identible Recognition Crops)を用いた,人間中心型行動認識の大規模人間AI比較研究を提案する。先程紹介したEpic ReduActは,36のEPIC KITCHENSビデオから得られた,組織的に空間的に縮小・時間的にスクランブルされたデータセットで,複数の空間的低減レベルと時間的条件にまたがる。認識性能は、3000人以上の被験者とSide4Videoモデルを用いて評価される。本分析は,空間的(高,中,低レベルの視覚的特徴)と時空間的(高,中,低レベルの視覚的特徴)の質的分析と,低テンポラル行動(LTA)と高テンポラル行動(HTA)への行動の分類を含む時空間的要因の定量化を組み合わせた。その結果,MIRCsからサブMIRCsへの移行に伴う人為的パフォーマンスの低下は顕著であり,手・物間相互作用などの意味論的に重要な手がかりに強く依存していることが示唆された。対照的に、モデルは徐々に劣化し、しばしば文脈的、中～低レベルの特徴に依存し、時として空間的縮小の下での信頼感が増す。時として、鍵となる空間的手がかりが保存されているとき、人間は揺らぎに頑健であり、一方でモデルは時間的破壊に敏感で、クラス依存の時間的感受性を示すことが多い。

論文の概要: Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations

関連論文リスト