Test-Time Training on Video Streams
- URL: http://arxiv.org/abs/2307.05014v3
- Date: Sat, 04 Jan 2025 03:59:48 GMT
- Title: Test-Time Training on Video Streams
- Authors: Renhao Wang, Yu Sun, Arnuv Tandon, Yossi Gandelsman, Xinlei Chen, Alexei A. Efros, Xiaolong Wang,
- Abstract summary: Prior work has established Test-Time Training (TTT) as a general framework to further improve a trained model at test time.
We extend TTT to the streaming setting, where multiple test instances arrive in temporal order.
Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets.
- Score: 66.63237260332984
- License:
- Abstract: Prior work has established Test-Time Training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is first trained on the same instance using a self-supervised task such as reconstruction. We extend TTT to the streaming setting, where multiple test instances - video frames in our case - arrive in temporal order. Our extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The improvements are more than 2.2x and 1.5x for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses strictly more information, training on all frames from the entire test video regardless of temporal order. This finding challenges those in prior work using synthetic videos. We formalize a notion of locality as the advantage of online over offline TTT, and analyze its role with ablations and a theory based on bias-variance trade-off.
Related papers
- Learning to (Learn at Test Time): RNNs with Expressive Hidden States [69.78469963604063]
We propose a new class of sequence modeling layers with linear complexity and an expressive hidden state.
Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training layers.
arXiv Detail & Related papers (2024-07-05T16:23:20Z) - NC-TTT: A Noise Contrastive Approach for Test-Time Training [19.0284321951354]
Noise-Contrastive Test-Time Training (NC-TTT) is a novel unsupervised TTT technique based on the discrimination of noisy feature maps.
By learning to classify noisy views of projected feature maps, and then adapting the model accordingly on new domains, classification performance can be recovered by an important margin.
arXiv Detail & Related papers (2024-04-12T10:54:11Z) - Depth-aware Test-Time Training for Zero-shot Video Object Segmentation [48.2238806766877]
We introduce a test-time training (TTT) strategy to address the problem of generalization to unseen videos.
Our key insight is to enforce the model to predict consistent depth during the TTT process.
Our proposed video TTT strategy provides significant superiority over state-of-the-art TTT methods.
arXiv Detail & Related papers (2024-03-07T06:40:53Z) - ClusT3: Information Invariant Test-Time Training [19.461441044484427]
Test-time training (TTT) methods have been developed in an attempt to mitigate these vulnerabilities.
We propose a novel unsupervised TTT technique based on the Mutual of Mutual Information between multi-scale feature maps and a discrete latent representation.
Experimental results demonstrate competitive classification performance on different popular test-time adaptation benchmarks.
arXiv Detail & Related papers (2023-10-18T21:43:37Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - SimOn: A Simple Framework for Online Temporal Action Localization [51.27476730635852]
We propose a framework, termed SimOn, that learns to predict action instances using the popular Transformer architecture.
Experimental results on the THUMOS14 and ActivityNet1.3 datasets show that our model remarkably outperforms the previous methods.
arXiv Detail & Related papers (2022-11-08T04:50:54Z) - Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language
Models [107.05966685291067]
We propose test-time prompt tuning (TPT) to learn adaptive prompts on the fly with a single test sample.
TPT improves the zero-shot top-1 accuracy of CLIP by 3.6% on average.
In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data.
arXiv Detail & Related papers (2022-09-15T17:55:11Z) - Revisiting Realistic Test-Time Training: Sequential Inference and
Adaptation by Anchored Clustering [37.76664203157892]
We develop a test-time anchored clustering (TTAC) approach to enable stronger test-time feature learning.
TTAC discovers clusters in both source and target domain and match the target clusters to the source ones to improve generalization.
We demonstrate that under all TTT protocols TTAC consistently outperforms the state-of-the-art methods on five TTT datasets.
arXiv Detail & Related papers (2022-06-06T16:23:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.