Test-Time Training on Video Streams
- URL: http://arxiv.org/abs/2307.05014v2
- Date: Wed, 12 Jul 2023 04:19:48 GMT
- Title: Test-Time Training on Video Streams
- Authors: Renhao Wang, Yu Sun, Yossi Gandelsman, Xinlei Chen, Alexei A. Efros,
Xiaolong Wang
- Abstract summary: Prior work has established test-time training (TTT) as a general framework to further improve a trained model at test time.
We extend TTT to the streaming setting, where multiple test instances arrive in temporal order.
Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets.
- Score: 54.07009446207442
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prior work has established test-time training (TTT) as a general framework to
further improve a trained model at test time. Before making a prediction on
each test instance, the model is trained on the same instance using a
self-supervised task, such as image reconstruction with masked autoencoders. We
extend TTT to the streaming setting, where multiple test instances - video
frames in our case - arrive in temporal order. Our extension is online TTT: The
current model is initialized from the previous model, then trained on the
current frame and a small window of frames immediately before. Online TTT
significantly outperforms the fixed-model baseline for four tasks, on three
real-world datasets. The relative improvement is 45% and 66% for instance and
panoptic segmentation. Surprisingly, online TTT also outperforms its offline
variant that accesses more information, training on all frames from the entire
test video regardless of temporal order. This differs from previous findings
using synthetic videos. We conceptualize locality as the advantage of online
over offline TTT. We analyze the role of locality with ablations and a theory
based on bias-variance trade-off.
Related papers
- Learning to (Learn at Test Time): RNNs with Expressive Hidden States [69.78469963604063]
We propose a new class of sequence modeling layers with linear complexity and an expressive hidden state.
Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training layers.
arXiv Detail & Related papers (2024-07-05T16:23:20Z) - NC-TTT: A Noise Contrastive Approach for Test-Time Training [19.0284321951354]
Noise-Contrastive Test-Time Training (NC-TTT) is a novel unsupervised TTT technique based on the discrimination of noisy feature maps.
By learning to classify noisy views of projected feature maps, and then adapting the model accordingly on new domains, classification performance can be recovered by an important margin.
arXiv Detail & Related papers (2024-04-12T10:54:11Z) - Depth-aware Test-Time Training for Zero-shot Video Object Segmentation [48.2238806766877]
We introduce a test-time training (TTT) strategy to address the problem of generalization to unseen videos.
Our key insight is to enforce the model to predict consistent depth during the TTT process.
Our proposed video TTT strategy provides significant superiority over state-of-the-art TTT methods.
arXiv Detail & Related papers (2024-03-07T06:40:53Z) - ClusT3: Information Invariant Test-Time Training [19.461441044484427]
Test-time training (TTT) methods have been developed in an attempt to mitigate these vulnerabilities.
We propose a novel unsupervised TTT technique based on the Mutual of Mutual Information between multi-scale feature maps and a discrete latent representation.
Experimental results demonstrate competitive classification performance on different popular test-time adaptation benchmarks.
arXiv Detail & Related papers (2023-10-18T21:43:37Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - SimOn: A Simple Framework for Online Temporal Action Localization [51.27476730635852]
We propose a framework, termed SimOn, that learns to predict action instances using the popular Transformer architecture.
Experimental results on the THUMOS14 and ActivityNet1.3 datasets show that our model remarkably outperforms the previous methods.
arXiv Detail & Related papers (2022-11-08T04:50:54Z) - Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language
Models [107.05966685291067]
We propose test-time prompt tuning (TPT) to learn adaptive prompts on the fly with a single test sample.
TPT improves the zero-shot top-1 accuracy of CLIP by 3.6% on average.
In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data.
arXiv Detail & Related papers (2022-09-15T17:55:11Z) - Revisiting Realistic Test-Time Training: Sequential Inference and
Adaptation by Anchored Clustering [37.76664203157892]
We develop a test-time anchored clustering (TTAC) approach to enable stronger test-time feature learning.
TTAC discovers clusters in both source and target domain and match the target clusters to the source ones to improve generalization.
We demonstrate that under all TTT protocols TTAC consistently outperforms the state-of-the-art methods on five TTT datasets.
arXiv Detail & Related papers (2022-06-06T16:23:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.