Fugu-MT 論文翻訳(概要): HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding

論文の概要: HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding

arxiv url: http://arxiv.org/abs/2605.08158v1
Date: Mon, 04 May 2026 09:35:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:49.407274
Title: HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding
Title（参考訳）: HY-Himmelテクニカルレポート:階層型インターリーブ多ストリームモーションエンコーディングによる長時間ビデオ理解
Authors: Haopeng Jin, Hongzhu Yi, Wenlong Zhao, Jinwen Luo, Shani Ye, Zhenyu Guan, Shiquan Dong, Tiankun Yang, Tao Yu,
Abstract要約: HY-Himmelは階層的なビデオ言語フレームワークで、セマンティックとモーションのキャパシティを別々に割り当てている。 Video-MMEでは、HY-Himmelは32フレームの高密度ベースラインを+2.3 pp (61.2から63.5%)超え、コンテキストトークンは3.6倍少ない。
参考スコア（独自算出の注目度）: 13.606091816002879
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Long-video understanding with multimodal language models suffers from three compounding bottlenecks: heavy decode cost to obtain dense RGB frames, quadratic token growth with frame count, and weak motion perception under sparse keyframe sampling. We present HY-Himmel, a hierarchical video-language framework that allocates semantic and motion capacity separately. A small set of sparse anchor I-frames is routed to the expensive host ViT to ground object identity and scene layout, while the far denser inter-frame intervals are encoded by a lightweight compressed-domain tri-stream adapter that distils motion evidence from motion-vector maps, residual maps, and I-frame context into aligned motion tokens. These tokens are injected into the LLM via a differentiable placeholder mechanism after a dedicated Stage-1 contrastive alignment that places the motion representation in a geometry compatible with the frozen visual backbone. On Video-MME, HY-Himmel surpasses the dense 32-frame baseline by +2.3 pp (61.2 to 63.5%) while using 3.6x fewer context tokens. Extensive ablations over stream composition, motion encoder family, fusion mode, alignment objective, anchor count, LoRA rank, and video duration confirm that the full tri-stream is necessary and sufficient for the observed gains.
Abstract（参考訳）: マルチモーダル言語モデルによる長期ビデオ理解には,高密度なRGBフレームを得るためのデコードコスト,フレームカウントによる2次トークン成長,スパースキーフレームサンプリング時の弱い動き知覚という,複雑なボトルネックが3つある。本稿では,セマンティック・モーション・キャパシティを個別に割り当てる階層型ビデオ言語フレームワークHY-Himmelを提案する。粗いアンカーIフレームの小さなセットを高価なホストViTにルーティングし、オブジェクトのアイデンティティとシーンレイアウトを地上に配置する一方、はるかに密集したフレーム間間隔は、モーションベクトルマップ、残留マップ、Iフレームコンテキストからの動作証拠をアライメントされたモーショントークンに除去する軽量な圧縮領域トリストリームアダプタによって符号化される。これらのトークンは、凍った視覚バックボーンと互換性のある形状に運動表現を配置する専用のStage-1コントラストアライメントの後、微分可能なプレースホルダ機構を介してLSMに注入される。 Video-MMEでは、HY-Himmelは32フレームの高密度ベースラインを+2.3 pp (61.2から63.5%)超え、コンテキストトークンは3.6倍少ない。ストリーム構成、モーションエンコーダファミリー、融合モード、アライメント目標、アンカー数、LoRAランク、ビデオ時間に対する広範囲の短縮により、観測された利得に対して完全な三流が必要であることが確認された。

論文の概要: HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding

関連論文リスト