Fugu-MT 論文翻訳(概要): Boosting Temporal Sentence Grounding via Causal Inference

論文の概要: Boosting Temporal Sentence Grounding via Causal Inference

arxiv url: http://arxiv.org/abs/2507.04958v1
Date: Mon, 07 Jul 2025 13:01:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-08 15:46:35.425313
Title: Boosting Temporal Sentence Grounding via Causal Inference
Title（参考訳）: 因果推論による時間文接地促進
Authors: Kefan Tang, Lihuo He, Jisheng Dang, Xinbo Gao,
Abstract要約: テンポラル・センテンス・グラウンディング(Temporal Sentence Grounding)は、あるテキストクエリに意味的に対応するビデオ中の関連モーメントを特定することを目的としている。これらの素因的相関は,(1) 特定の動詞や句の頻繁な共起など,テキストデータに固有の偏り,(2) ビデオコンテンツにおける顕著なパターンや反復パターンに過度に適合する傾向,の2つの要因から生じる。本稿では, 因果推論を利用した新たなTSGフレームワーク, 因果介入, 反ファクト推論を提案する。
参考スコア（独自算出の注目度）: 48.04297516212874
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Temporal Sentence Grounding (TSG) aims to identify relevant moments in an untrimmed video that semantically correspond to a given textual query. Despite existing studies having made substantial progress, they often overlook the issue of spurious correlations between video and textual queries. These spurious correlations arise from two primary factors: (1) inherent biases in the textual data, such as frequent co-occurrences of specific verbs or phrases, and (2) the model's tendency to overfit to salient or repetitive patterns in video content. Such biases mislead the model into associating textual cues with incorrect visual moments, resulting in unreliable predictions and poor generalization to out-of-distribution examples. To overcome these limitations, we propose a novel TSG framework, causal intervention and counterfactual reasoning that utilizes causal inference to eliminate spurious correlations and enhance the model's robustness. Specifically, we first formulate the TSG task from a causal perspective with a structural causal model. Then, to address unobserved confounders reflecting textual biases toward specific verbs or phrases, a textual causal intervention is proposed, utilizing do-calculus to estimate the causal effects. Furthermore, visual counterfactual reasoning is performed by constructing a counterfactual scenario that focuses solely on video features, excluding the query and fused multi-modal features. This allows us to debias the model by isolating and removing the influence of the video from the overall effect. Experiments on public datasets demonstrate the superiority of the proposed method. The code is available at https://github.com/Tangkfan/CICR.
Abstract（参考訳）: テンポラル・センテンス・グラウンドディング(TSG)は、与えられたテキストクエリに意味的に対応する未編集ビデオ中の関連モーメントを特定することを目的としている。既存の研究は大きな進歩を遂げたものの、ビデオとテキストのクエリ間の急激な相関の問題を見落としていることが多い。これらの素因的相関は,(1) 特定の動詞や句の頻繁な共起など,テキストデータに固有の偏り,(2) ビデオコンテンツにおける顕著なパターンや反復パターンに過度に適合する傾向,の2つの要因から生じる。このようなバイアスは、モデルを誤った視覚的モーメントに関連付け、信頼性の低い予測と配布外例への一般化の欠如を招いた。これらの制約を克服するために,因果推論を利用した新たなTSGフレームワーク,因果的介入,反ファクト的推論を提案して,突発的な相関を排除し,モデルの堅牢性を高める。具体的には、まず、構造的因果モデルを用いて、因果的観点からTSGタスクを定式化する。そして、特定の動詞や句に対するテキストバイアスを反映する未保存な共同創設者に対処するために、do-calculusを用いて因果効果を推定するテキスト因果介入を提案する。さらに、クエリと融合したマルチモーダル機能を除いて、ビデオ機能のみに焦点を絞った対物シナリオを構築することで、視覚的対物推論を行う。これにより、全体的な効果からビデオの影響を分離して取り除くことで、モデルを疎外することができる。公開データセットの実験は,提案手法の優位性を実証している。コードはhttps://github.com/Tangkfan/CICRで公開されている。

論文の概要: Boosting Temporal Sentence Grounding via Causal Inference

関連論文リスト