Fugu-MT 論文翻訳(概要): ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions

論文の概要: ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions

arxiv url: http://arxiv.org/abs/2604.07772v1
Date: Thu, 09 Apr 2026 03:51:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:05.681448
Title: ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions
Title（参考訳）: ESOM: オープンワールド動的定義によるストリーミングビデオ異常の効率的な理解
Authors: Zihao Liu, Xiaoyu Wu, Wenna Li, Jianqin Wu, Linlin Yang,
Abstract要約: オープンワールドビデオ異常検出(OWVAD)は、異なる異常定義の下で異常事象を検出し、説明することを目的としている。最近のMLLMベースの手法は、将来有望なオープンワールドの一般化を示しているが、それでも3つの大きな制限に悩まされている。本稿では,トレーニング不要な効率的なストリーミングOWVADモデルであるESOMを提案する。
参考スコア（独自算出の注目度）: 27.912128185225054
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal inference, and a Probabilistic Scoring module that converts interval-level textual outputs into frame-level anomaly scores. In addition, this paper introduces OpenDef-Bench, a new benchmark with clean surveillance videos and diverse natural anomaly definitions for evaluating performance under varying conditions. Extensive experiments show that ESOM achieves real-time efficiency on a single GPU and state-of-the-art performance in anomaly temporal localization, classification, and description generation. The code and benchmark will be released at https://github.com/Kamino666/ESOM_OpenDef-Bench.
Abstract（参考訳）: オープンワールドビデオ異常検出(OWVAD)は、インテリジェント監視やライブストリーミングコンテンツモデレーションなどのアプリケーションにおいて重要な、異なる異常定義の下での異常事象の検出と説明を目的としている。 MLLMに基づく最近の手法では、オープンワールドの一般化が期待できるが、実用的展開の非効率性、ストリーミング処理適応の欠如、モデリングと評価の両方において動的異常定義の制限という3つの大きな制限がある。これらの問題に対処するため,本研究では,トレーニング不要な効率的なストリーミングOWVADモデルであるESOMを提案する。 ESOMには、幻覚を減らすためのユーザプロンプトを構築するための定義正規化モジュール、冗長な視覚トークンを圧縮するためのフレーム間整合型トークンマージモジュール、効率的な因果推論のためのハイブリッドストリーミングメモリモジュール、フレームレベルのテキスト出力をフレームレベルの異常スコアに変換する確率的スコアモジュールが含まれる。さらに、クリーンな監視ビデオと様々な条件下での性能を評価するための多様な自然異常定義を備えた新しいベンチマークOpenDef-Benchを紹介する。大規模な実験により、ESOMは1つのGPU上でリアルタイムな効率を実現し、異常な時間的ローカライゼーション、分類、記述生成における最先端のパフォーマンスを実現する。コードとベンチマークはhttps://github.com/Kamino666/ESOM_OpenDef-Benchで公開される。

論文の概要: ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions

関連論文リスト