Fugu-MT 論文翻訳(概要): S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

論文の概要: S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

arxiv url: http://arxiv.org/abs/2605.28831v2
Date: Mon, 08 Jun 2026 12:03:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-15 07:09:36.546254
Title: S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering
Title（参考訳）: S3Mem:長期対話型質問応答のための時空間時空間イベントメモリ
Authors: Encheng Su, Jianyu Wu, Jinouwen Zhang, Qiucheng Yu, Chen Tang, Pengze Li, Lintao Wang, Aoran Wang, Xinzhu Ma, Shixiang Tang, Yizhou Wang, Houqiang Li,
Abstract要約: ロングホライゾンの記憶問題に対する答えは、しばしば異質な歴史からスパースな証拠を必要とする。 S3Mem(Structured Spatiotemporal Scene-Event Memory)は,テキスト,視覚,エージェント使用履歴を構造化されたシーン単位に書き込むクエリ時メモリインタフェースである。 LoCoMo、EMemBench Visual Games、AMA-Benchの他、S3Memは強力なスコアツーケントレードオフを提供する。
参考スコア（独自算出の注目度）: 58.90783999951707
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long-horizon memory question answering often requires sparse evidence from heterogeneous histories, including events, object states, visual observations, temporal relations, and causal steps. Existing memory interfaces expand reader context, retrieve semantically related chunks, or expose graph neighborhoods, but they are not explicitly designed to select compact evidence for a fixed reader. We propose Structured Spatiotemporal Scene--Event Memory (S3Mem), a query-time memory interface that writes textual, visual, and agent-use histories into structured scene--event units and routes compact evidence packs to the reader. Its router scores candidate units, query anchors, and anchor--support links, enabling both single-hop selection and short multi-hop evidence chains without reader fine-tuning or test-time training. Across LoCoMo, EMemBench Visual Games, and AMA-Bench, S3Mem provides a strong score--token trade-off, with the clearest gains on localized event, state, temporal, causal, or provenance evidence. On LoCoMo, S3Mem reaches \(0.48\) F1 and \(0.40\) BLEU with (1{,}073) evidence tokens per question, about \(15.8\times\) fewer than the LoCoMo reference. On EMemBench Visual Games, it obtains the best F1 and second-best accuracy with only \(189\)tokens.On AMA-Bench, it is not the highest-scoring method, but remains competitive while using the fewest reader-visible evidence tokens.
Abstract（参考訳）: ロングホライゾンの記憶問題に答えるには、出来事、対象状態、視覚的観察、時間的関係、因果ステップなどを含む異質な歴史からのスパースな証拠を必要とすることが多い。既存のメモリインターフェースは、読み取りコンテキストを拡張したり、セマンティックに関連付けられたチャンクを検索したり、グラフ近傍を公開したりするが、固定読影器のコンパクトなエビデンスを明示的に選択するためには設計されていない。本研究では,S3Mem(Structured Spatiotemporal Scene-Event Memory)を提案する。S3Mem(Structured Spatiotemporal Scene-Event Memory)は,テキスト,ビジュアル,エージェント使用履歴を構造化シーン単位に書き込んで,コンパクトエビデンスパックをリーダにルーティングする。ルータは候補ユニット,クエリアンカー,アンカー対応リンクをスコアし,シングルホップ選択と短いマルチホップエビデンスチェーンをリーダの微調整やテストタイムトレーニングなしで実現する。 LoCoMo、EMemBench Visual Games、AMA-Benchの他、S3Memは強力なスコアツーケントレードオフを提供する。 LoCoMo 上では、S3Mem は (1{,}073) の証拠トークンを持つ \(0.48\) F1 と \(0.40\) BLEU に達する。 EMemBench Visual Gamesでは、最良F1と第2Bestの精度を189tokensで取得し、AMA-Benchでは最高スコア法ではなく、最も少ない読取可能なエビデンストークンを用いて競争力を維持している。

論文の概要: S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

関連論文リスト