Fugu-MT 論文翻訳(概要): ForeSea: AI Forensic Search with Multi-modal Queries for Video Surveillance

論文の概要: ForeSea: AI Forensic Search with Multi-modal Queries for Video Surveillance

arxiv url: http://arxiv.org/abs/2603.22872v1
Date: Tue, 24 Mar 2026 07:15:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.351939
Title: ForeSea: AI Forensic Search with Multi-modal Queries for Video Surveillance
Title（参考訳）: ForeSea: ビデオサーベイランスのためのマルチモーダルクエリによるAIForensic Search
Authors: Hyojin Park, Yi Li, Janghoon Cho, Sungha Choi, Jungsoo Lee, Taotao Jing, Shuai Zhang, Munawar Hayat, Dashan Gao, Ning Bi, Fatih Porikli,
Abstract要約: ForeSeaは3段階のプラグアンドプレイパイプラインを備えたAI法医学検索システムである。 ForeSeaは従来のビデオRAGモデルよりも精度を3.5%向上し、一時IoUは11.0向上した。
参考スコア（独自算出の注目度）: 56.15563109738998
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite decades of work, surveillance still struggles to find specific targets across long, multi-camera video. Prior methods -- tracking pipelines, CLIP based models, and VideoRAG -- require heavy manual filtering, capture only shallow attributes, and fail at temporal reasoning. Real-world searches are inherently multimodal (e.g., "When does this person join the fight?" with the person's image), yet this setting remains underexplored. Also, there are no proper benchmarks to evaluate those setting - asking video with multimodal queries. To address this gap, we introduce ForeSeaQA, a new benchmark specifically designed for video QA with image-and-text queries and timestamped annotations of key events. The dataset consists of long-horizon surveillance footage paired with diverse multimodal questions, enabling systematic evaluation of retrieval, temporal grounding, and multimodal reasoning in realistic forensic conditions. Not limited to this benchmark, we propose ForeSea, an AI forensic search system with a 3-stage, plug-and-play pipeline. (1) A tracking module filters irrelevant footage; (2) a multimodal embedding module indexes the remaining clips; and (3) during inference, the system retrieves top-K candidate clips for a Video Large Language Model (VideoLLM) to answer queries and localize events. On ForeSeaQA, ForeSea improves accuracy by 3.5% and temporal IoU by 11.0 over prior VideoRAG models. To our knowledge, ForeSeaQA is the first benchmark to support complex multimodal queries with precise temporal grounding, and ForeSea is the first VideoRAG system built to excel in this setting.
Abstract（参考訳）: 何十年にもわたる努力にもかかわらず、監視は長いマルチカメラビデオで特定のターゲットを見つけるのに苦戦している。以前のメソッド -- パイプラインのトラッキング、CLIPベースのモデル、VideoRAG -- は、重い手動フィルタリング、浅い属性のみをキャプチャし、時間的推論で失敗する必要があった。現実世界の検索は本質的にマルチモーダルである(例えば、この人はいつ戦いに参加するのか?)が、この設定は未調査のままである。また、これらの設定を評価するための適切なベンチマークはなく、マルチモーダルクエリでビデオに問い合わせる。このギャップに対処するため、ForeSeaQAは、画像とテキストのクエリとキーイベントのタイムスタンプ付きアノテーションを備えたビデオQA用に特別に設計された新しいベンチマークである。このデータセットは、多様なマルチモーダル質問と組み合わせた長距離監視映像で構成され、現実的な法医学的条件下での検索、時間的接地、多モーダル推論の体系的評価を可能にする。このベンチマークに限らず、3段階のプラグアンドプレイパイプラインを備えたAI法医学検索システムであるForeSeaを提案する。 1)追跡モジュールは、無関係な映像をフィルタリングし、(2)マルチモーダル埋め込みモジュールは、残りのクリップをインデックスし、(3)推論中に、ビデオ大言語モデル(VideoLLM)用のトップK候補クリップを検索して、クエリに応答し、イベントをローカライズする。 ForeSeaQAでは、以前のVideoRAGモデルよりも精度が3.5%向上し、一時IoUが11.0向上した。私たちの知る限り、ForeSeaQAは、正確な時間的根拠を持つ複雑なマルチモーダルクエリをサポートする最初のベンチマークです。

論文の概要: ForeSea: AI Forensic Search with Multi-modal Queries for Video Surveillance

関連論文リスト