Fugu-MT 論文翻訳(概要): The TIME Machine: On The Power of Motion for Efficient Perception

論文の概要: The TIME Machine: On The Power of Motion for Efficient Perception

arxiv url: http://arxiv.org/abs/2605.23045v1
Date: Thu, 21 May 2026 21:22:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.104572
Title: The TIME Machine: On The Power of Motion for Efficient Perception
Title（参考訳）: TIMEマシン:効率的な知覚のための動きの力について
Authors: Mantas Skackauskas, Xinyue Hao, Laura Sevilla-Lara,
Abstract要約: 本稿では,映像表現の中心となるモダリティとして動きを利用する新しい手法を提案する。特に、ビデオ中の運動をポイントトラックの形で考えると、私たちはマスク付きオートエンコーダを使ってトラックの一部を隠蔽し、オートエンコーダを訓練し、行方不明のトラックを再構築する。ビデオの表現にモーションを使うことは、ビデオ技術の中核的な限界の両方に実際に対処できることが示される。
参考スコア（独自算出の注目度）: 10.074545631396383
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video representation learning has seen tremendous progress in recent years. This has been driven by many factors, including the scale of training and the success of visual models trained contrastively with language. While these factors have pushed the boundaries of what video models can do, they also introduce their own set of limitations: first, scaling video models can reach prohibitive costs and second, learning from language restricts the range of concepts that can be learned to those in captions. As a result, video models still struggle with temporal understanding. In this paper we propose a novel approach that uses motion as the central modality for video representation. In particular, given the motion in a video in the form of point-tracks, we use a masked-autoencoder to mask some of the tracks and train the autoencoder to reconstruct the missing tracks. This allows us to learn a representation in a self-supervised manner. We show that using motion to represent videos actually addresses both of the core limitations of video technology. First, it allows us to massively reduce the scale of training data, as motion is inherently appearance-independent and hence needs fewer examples to generalize well. Second, motion allows us to bypass the language-dependent training paradigm, learning better fine-grained concepts. The result is an embedding that we call TIME (Temporally Informed Motion Embedding), a representation trained exclusively on synthetic motion data. We test this embedding on a wide set of tasks in a zero-shot manner. We observe that without bells and whistles, performance is on par with state-of-the-art models using up to 4 orders of magnitude less training data. This is a stepping stone towards a new paradigm of video models that are both more temporally aware as well as more scalable.
Abstract（参考訳）: 近年,映像表現学習は飛躍的な進歩を遂げている。これは、訓練の規模や、言語と対照的に訓練された視覚モデルの成功など、多くの要因によって推進されている。第一に、ビデオモデルをスケールすることは禁断のコストに到達し、第二に、言語から学ぶことは、キャプションで学べる概念の範囲を制限する。その結果、ビデオモデルは時間的理解に苦戦している。本稿では,映像表現の中心となるモダリティとして動きを利用する新しい手法を提案する。特に、ビデオ中の運動をポイントトラックの形で考えると、私たちはマスク付きオートエンコーダを使ってトラックの一部を隠蔽し、オートエンコーダを訓練し、行方不明のトラックを再構築する。これにより、自己管理的な方法で表現を学ぶことができます。ビデオの表現にモーションを使うことは、ビデオ技術の中核的な限界の両方に実際に対処できることが示される。まず、動きが本質的に外見に依存しないため、うまく一般化するサンプルが少ないため、トレーニングデータの規模を大幅に削減できます。第二に、モーションは言語に依存したトレーニングパラダイムを回避し、よりきめ細かい概念を学習します。その結果、私たちがTIME(Temporally Informed Motion Embedding)と呼ぶ埋め込みが生まれました。我々は、この埋め込みをゼロショット方式で広範囲のタスクに対してテストする。ベルとホイッスルなしでは、最大4桁のトレーニングデータを使用して、最先端のモデルとパフォーマンスが同等であることを観察する。これは、ビデオモデルの新たなパラダイムへの一歩であり、より時間的に認識され、よりスケーラブルである。

論文の概要: The TIME Machine: On The Power of Motion for Efficient Perception

関連論文リスト