Fugu-MT 論文翻訳(概要): TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning

論文の概要: TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning

arxiv url: http://arxiv.org/abs/2604.00696v1
Date: Wed, 01 Apr 2026 09:52:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-02 16:44:31.930217
Title: TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning
Title（参考訳）: TTA-Vid:ビデオ推論のための一般的なテスト時間適応
Authors: Soumya Shamarao Jahagirdar, Edson Araujo, Anna Kukleva, M. Jehanzeb Mirza, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Rogerio Feris, James R. Glass, Hilde Kuehne,
Abstract要約: テスト時強化学習(Test-Time Reinforcement Learning)のパラダイムをビデオ言語データに活用することにより,事前学習されたモデルを明示的なラベルなしで,テスト時のビデオサンプルに適応させることができる。ビデオアプローチのためのテスト時間適応(TTA-Vid)は、同時に動作する2つのコンポーネントを組み合わせる。 TTA-Vidは、様々なビデオ推論タスクで一貫した改善をもたらし、大規模データで訓練された最先端の手法より優れている。
参考スコア（独自算出の注目度）: 54.70019148172847
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent video reasoning models have shown strong results on temporal and multimodal understanding, yet they depend on large-scale supervised data and multi-stage training pipelines, making them costly to train and difficult to adapt to new domains. In this work, we leverage the paradigm of Test-Time Reinforcement Learning on video-language data to allow for adapting a pretrained model to incoming video samples at test-time without explicit labels. The proposed test-time adaptation for video approach (TTA-Vid) combines two components that work simultaneously: (1) a test-time adaptation that performs step-by-step reasoning at inference time on multiple frame subsets. We then use a batch-aware frequency-based reward computed across different frame subsets as pseudo ground truth to update the model. It shows that the resulting model trained on a single batch or even a single sample from a dataset, is able to generalize at test-time to the whole dataset and even across datasets. Because the adaptation occurs entirely at test time, our method requires no ground-truth annotations or dedicated training splits. Additionally, we propose a multi-armed bandit strategy for adaptive frame selection that learns to prioritize informative frames, guided by the same reward formulation. Our evaluation shows that TTA-Vid yields consistent improvements across various video reasoning tasks and is able to outperform current state-of-the-art methods trained on large-scale data. This highlights the potential of test-time reinforcement learning for temporal multimodal understanding.
Abstract（参考訳）: 最近のビデオ推論モデルは、時間的およびマルチモーダルな理解において強力な結果を示しているが、それらは大規模な教師付きデータとマルチステージのトレーニングパイプラインに依存しているため、トレーニングにコストがかかり、新しいドメインへの適応が困難である。本研究では,テスト時強化学習のパラダイムをビデオ言語データに適用し,事前学習されたモデルを明示的なラベルなしでテスト時のビデオサンプルに適応できるようにする。ビデオアプローチのためのテストタイム適応(TTA-Vid)は,(1)複数のフレームサブセット上で推論時にステップバイステップの推論を行うテストタイム適応という,同時動作する2つのコンポーネントを組み合わせる。次に、異なるフレームサブセット間で計算されたバッチ対応の周波数ベースの報酬を、擬似基底真理として使用し、モデルを更新する。これは、単一のバッチでトレーニングされた結果のモデル、あるいはデータセットからの1つのサンプルでさえ、データセット全体、さらにはデータセット全体に対してテスト時に一般化できることを示している。適応はテスト時に完全に発生するので、本手法では、基礎となる厳密なアノテーションや専用のトレーニング分割を必要としない。また,適応フレーム選択のための多腕バンディット戦略を提案し,同じ報奨式で導かれる情報フレームの優先順位付けを学習する。評価の結果、TTA-Vidは様々なビデオ推論タスクに一貫した改善をもたらし、大規模データで訓練された最先端の手法よりも優れていることがわかった。このことは、時間的マルチモーダル理解のためのテスト時間強化学習の可能性を強調している。

論文の概要: TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning

関連論文リスト