Fugu-MT 論文翻訳(概要): ATM: Action Temporality Modeling for Video Question Answering

論文の概要: ATM: Action Temporality Modeling for Video Question Answering

arxiv url: http://arxiv.org/abs/2309.02290v1
Date: Tue, 5 Sep 2023 14:52:38 GMT
ステータス: 翻訳完了
システム内更新日: 2023-09-06 14:14:12.839220
Title: ATM: Action Temporality Modeling for Video Question Answering
Title（参考訳）: ATM:ビデオ質問応答のための行動時間モデリング
Authors: Junwen Chen, Jie Zhu, Yu Kong
Abstract要約: 本稿では,3次元一意性による時間性推論のための行動時性モデリング(ATM)を提案する。 ATMは,複数のビデオQAの精度において従来の手法よりも優れており,真の時間性推論能力が向上していることを示す。
参考スコア（独自算出の注目度）: 27.239039564918134
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite significant progress in video question answering (VideoQA), existing methods fall short of questions that require causal/temporal reasoning across frames. This can be attributed to imprecise motion representations. We introduce Action Temporality Modeling (ATM) for temporality reasoning via three-fold uniqueness: (1) rethinking the optical flow and realizing that optical flow is effective in capturing the long horizon temporality reasoning; (2) training the visual-text embedding by contrastive learning in an action-centric manner, leading to better action representations in both vision and text modalities; and (3) preventing the model from answering the question given the shuffled video in the fine-tuning stage, to avoid spurious correlation between appearance and motion and hence ensure faithful temporality reasoning. In the experiments, we show that ATM outperforms previous approaches in terms of the accuracy on multiple VideoQAs and exhibits better true temporality reasoning ability.
Abstract（参考訳）: ビデオ質問応答 (VideoQA) の大幅な進歩にもかかわらず、既存の手法ではフレーム間の因果的・時間的推論を必要とする質問が不足している。これは不正確な動きの表現による。 We introduce Action Temporality Modeling (ATM) for temporality reasoning via three-fold uniqueness: (1) rethinking the optical flow and realizing that optical flow is effective in capturing the long horizon temporality reasoning; (2) training the visual-text embedding by contrastive learning in an action-centric manner, leading to better action representations in both vision and text modalities; and (3) preventing the model from answering the question given the shuffled video in the fine-tuning stage, to avoid spurious correlation between appearance and motion and hence ensure faithful temporality reasoning. 実験では,複数のビデオqaの正確性の観点からatmが従来のアプローチを上回っており,真の時間的推論能力が向上することを示す。

関連論文リスト

LeAdQA: LLM-Driven Context-Aware Temporal Grounding for Video Question Answering [10.060267989615813]
本稿では,これらのギャップを埋める斬新な手法であるLeAdQAを紹介する。 NExT-QA, IntentQA, NExT-GQAに関する実験により, 本手法の正確な視覚的基盤化は, 映像検索関係の理解を著しく向上させることを示した。
論文参考訳（メタデータ） (2025-07-20T01:57:00Z)
The Mirage of Multimodality: Where Truth is Tested and Honesty Unravels [22.497467057872377]
本研究は,マルチモーダル文脈におけるシステムIおよびシステムII推論に伴う歪みに関する最初の系統的研究である。遅い推論モデルは、不完全あるいは誤解を招く視覚的入力を提示すると、欠陥のある推論をサポートするために、妥当で誤った詳細をつくり出す傾向にあることを実証する。
論文参考訳（メタデータ） (2025-05-26T16:55:38Z)
Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
本稿では、画像入力を重要な推論段階に移行する戦略であるTake-Allong Visual Conditioning (TVC)を提案する。 TVCは、推論を通して視覚的なコンポーネントへの注意を維持するのに役立つ。提案手法は,5つの数学的推論ベンチマークにおいて,最先端の性能を平均で達成する。
論文参考訳（メタデータ） (2025-03-17T16:45:12Z)
Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation [47.61288672890036]
本研究では,テキスト・ビデオ・モデルにおける自己注意クエリ機能がどのように動作,構造,アイデンティティを制御するかを検討する。分析の結果,Qがレイアウトに影響を及ぼすだけでなく,Qをデノナイズする際にも主観的同一性に強い影響を及ぼすことが明らかとなった。本研究では,(1)既存の手法より20倍効率の良いゼロショットモーショントランスファー法,(2)一貫したマルチショットビデオ生成のためのトレーニングフリー手法の2つの応用例を示す。
論文参考訳（メタデータ） (2024-12-10T18:49:39Z)
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level [63.18855743293851]
Motion-Grounded Video Reasoningは、入力された質問に応じて視覚的回答(ビデオセグメンテーションマスク)を必要とする新しい動作理解タスクである。このタスクは、質問による暗黙の推論を可能にすることで、明示的なアクション/モーショングラウンドの既存の基盤作業を、より一般的なフォーマットに拡張する。我々はMotion-Grounded Video Reasoning Assistant(MORA)という新しいベースラインモデルを導入する。
論文参考訳（メタデータ） (2024-11-15T03:45:09Z)
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBenchは、ビデオの微細な時間的理解を評価するための新しいベンチマークだ。ビデオクリップの時間的ダイナミクスを詳述した2Kの高品質な人間のアノテーションから派生した10KのビデオQ&Aペアで構成されている。 GPT-4oのような最先端のモデルは、TemporalBench上で38.5%の質問応答精度しか達成していない。
論文参考訳（メタデータ） (2024-10-14T17:59:58Z)
TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition [68.53072549422775]
学生教師による半教師付き学習フレームワークTimeBalanceを提案する。時間的に不変であり,時間的に異なる教師から知識を抽出する。提案手法は,3つの動作認識ベンチマーク上での最先端性能を実現する。
論文参考訳（メタデータ） (2023-03-28T19:28:54Z)
Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
我々は、時間的ダイナミクスをよりよく学習するために、ビデオモデルのための単純で効果的な自己教師型タスクを設計する。ビデオフレームの時間順序を余分な自己監督として学習し、ランダムにシャッフルされたフレームを低信頼出力に強制する。様々なビデオ行動認識タスクにおいて,本手法の有効性と最先端のビデオ変換器との互換性を実証する。
論文参考訳（メタデータ） (2022-07-19T04:44:08Z)
Exploring Motion and Appearance Information for Temporal Sentence Grounding [52.01687915910648]
本研究では、時間的文のグラウンド化を解決するために、MARN(Motion-Appearance Reasoning Network)を提案する。動作誘導と外見誘導のオブジェクト関係を学習するために,動作分岐と外見分岐を別々に開発する。提案するMARNは,従来の最先端手法よりも大きなマージンで優れていた。
論文参考訳（メタデータ） (2022-01-03T02:44:18Z)
Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
自己制御学習(SRL)は、中間表現を連続的に制御し、現在のタイムスタンプのフレームにおける新しい情報を強調する表現を作り出すことを目的としている。 SRLは2つのエゴセントリックなビデオデータセットと2つの第三者のビデオデータセットにおいて、既存の最先端技術よりも大幅に優れています。
論文参考訳（メタデータ） (2021-11-23T03:29:18Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。