Fugu-MT 論文翻訳(概要): LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute

論文の概要: LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute

arxiv url: http://arxiv.org/abs/2605.06809v1
Date: Thu, 07 May 2026 18:08:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.532021
Title: LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute
Title（参考訳）: 高速ビデオ認識はいつ, どこで, 何を計算すべきかを学習する
Authors: Ali Salamatian, Anthony Fuller, Pritam Sarkar, James R. Green, Leonid Sigal, Evan Shelhamer,
Abstract要約: LookWhenは、ビデオ認識を、いつ、どこで、何を計算すべきかを学習に分解するセレクタ・エクストラクタ・フレームワークである。私たちの浅いセレクタはスケールダウンされたビデオを受け取り、すべてのトークンを時空で素早くスコアし、深層抽出器はトップK選択トークンを取得してフルビデオ表現を近似します。 LookWhenはInternVideo2-Bの6.7倍の精度で効率が良い。
参考スコア（独自算出の注目度）: 32.29218279577984
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers dominate video recognition. They split videos into tokens, and processing them has expensive superlinear computational cost. Yet videos are filled with redundancy, so we can question the need for this expense. We introduce LookWhen, a selector-extractor framework that factorizes video recognition into learning when, where, and what to compute. Our shallow selector gets a scaled-down video and quickly scores all tokens across space-time, while our deep extractor gets the top-K selected tokens to approximate full-video representations without actually processing all the tokens. A key challenge is defining effective supervision for selection and extraction. For selection pre-training, we introduce a score on representations that ranks tokens by uniqueness using a simple nearest-neighbor distance. For extraction pre-training, we distill both a video teacher and an image teacher, for which we normalize its frame-wise representations to learn what changes within videos. Through these strategies, our selector-extractor learns general and efficient representations for feature extraction or fine-tuning to a task. Through experiments on Kinetics-400, SSv2, Epic-Kitchens, Diving48, Jester, and Charades, we show that LookWhen achieves a better accuracy-computation trade-off than efficient models and upgraded baselines of similar size. LookWhen Pareto-dominates in accuracy-FLOPs on 9 of 12 cases (6 tasks x 2 settings) and roughly matches on 3. In accuracy-throughput, measuring time in practice, LookWhen is more efficient still at 6.7x faster than InternVideo2-B at equal accuracy.
Abstract（参考訳）: トランスフォーマーはビデオ認識を支配します。彼らはビデオをトークンに分割し、それらを処理するのに高額な超線形計算コストがかかる。しかし、ビデオは冗長性に満ちているので、この費用の必要性に疑問を投げかけることができる。 LookWhenは、ビデオ認識を、いつ、どこで、何を計算すべきかを学習に分解するセレクタ・エクストラクタ・フレームワークである。私たちの浅いセレクタはスケールダウンされたビデオを受け取り、すべてのトークンを時空で素早くスコアし、深層抽出器はトップK選択トークンを取得して、すべてのトークンを実際に処理することなくフルビデオ表現を近似します。重要な課題は、選択と抽出の効果的な監督を定義することです。選択事前学習において,簡単な最寄り距離を用いてトークンを一意性でランク付けする表現のスコアを導入する。事前学習を抽出するために,ビデオ教師と画像教師の両方を蒸留し,そのフレームワイド表現を正規化し,ビデオ内の変化を学習する。これらの戦略を通じて,提案するセレクタ・エクストラクタは,タスクに対する特徴抽出や微調整のための汎用的で効率的な表現を学習する。 Kinetics-400, SSv2, Epic-Kitchens, Diving48, Jester, Charadesの実験を通して、LookWhenは効率的なモデルよりも精度の高い計算トレードオフを実現し、同様のサイズのベースラインをアップグレードした。 Look When Pareto-dominates in accuracy-FLOPs on 9 case (6 task x 2 settings) and roughly match on 3。精度のスループット、実際の測定時間では、LookWhenはInternVideo2-Bの6.7倍の精度で効率が良い。

論文の概要: LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute

関連論文リスト