Related papers: Test-Time Training on Video Streams

Test-Time Training on Video Streams

URL: http://arxiv.org/abs/2307.05014v2
Date: Wed, 12 Jul 2023 04:19:48 GMT
Title: Test-Time Training on Video Streams
Authors: Renhao Wang, Yu Sun, Yossi Gandelsman, Xinlei Chen, Alexei A. Efros, Xiaolong Wang
Abstract summary: Prior work has established test-time training (TTT) as a general framework to further improve a trained model at test time. We extend TTT to the streaming setting, where multiple test instances arrive in temporal order. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets.
Score: 54.07009446207442
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Prior work has established test-time training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is trained on the same instance using a self-supervised task, such as image reconstruction with masked autoencoders. We extend TTT to the streaming setting, where multiple test instances - video frames in our case - arrive in temporal order. Our extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The relative improvement is 45% and 66% for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses more information, training on all frames from the entire test video regardless of temporal order. This differs from previous findings using synthetic videos. We conceptualize locality as the advantage of online over offline TTT. We analyze the role of locality with ablations and a theory based on bias-variance trade-off.

Related papers

CTA: Cross-Task Alignment for Better Test Time Training [10.54024648915477]
Test-Time Training (TTT) has emerged as an effective method to enhance model robustness.<n>We introduce CTA (Cross-Task Alignment), a novel approach for improving TTT.<n>We show substantial improvements in robustness and generalization over the state-of-the-art on several benchmark datasets.
arXiv Detail & Related papers (2025-07-07T17:33:20Z)
Test-Time Training Provably Improves Transformers as In-context Learners [49.09821664572445]
We investigate a gradient-based TTT algorithm for in-context learning. We train a transformer model on the in-context demonstrations provided in the test prompt. As our empirical contribution, we study the benefits of TTT for TabPFN.
arXiv Detail & Related papers (2025-03-14T20:06:37Z)
IT$^3$: Idempotent Test-Time Training [95.78053599609044]
Deep learning models often struggle when deployed in real-world settings due to distribution shifts between training and test data.<n>We present Idempotent Test-Time Training (IT$3$), a novel approach that enables on-the-fly adaptation to distribution shifts using only the current test instance.<n>Our results suggest that idempotence provides a universal principle for test-time adaptation that generalizes across domains and architectures.
arXiv Detail & Related papers (2024-10-05T15:39:51Z)
NC-TTT: A Noise Contrastive Approach for Test-Time Training [19.0284321951354]
Noise-Contrastive Test-Time Training (NC-TTT) is a novel unsupervised TTT technique based on the discrimination of noisy feature maps. By learning to classify noisy views of projected feature maps, and then adapting the model accordingly on new domains, classification performance can be recovered by an important margin.
arXiv Detail & Related papers (2024-04-12T10:54:11Z)
Depth-aware Test-Time Training for Zero-shot Video Object Segmentation [48.2238806766877]
We introduce a test-time training (TTT) strategy to address the problem of generalization to unseen videos. Our key insight is to enforce the model to predict consistent depth during the TTT process. Our proposed video TTT strategy provides significant superiority over state-of-the-art TTT methods.
arXiv Detail & Related papers (2024-03-07T06:40:53Z)
Technical Report for ICCV 2023 Visual Continual Learning Challenge: Continuous Test-time Adaptation for Semantic Segmentation [18.299549256484887]
The goal of the challenge is to develop a test-time adaptation (TTA) method, which could adapt the model to gradually changing domains in video sequences for semantic segmentation task. The TTA methods are evaluated in each image sequence (video) separately, meaning the model is reset to the source model state before the next sequence. The proposed solution secured a 3rd place in a challenge and received an innovation award.
arXiv Detail & Related papers (2023-10-20T14:20:21Z)
ClusT3: Information Invariant Test-Time Training [19.461441044484427]
Test-time training (TTT) methods have been developed in an attempt to mitigate these vulnerabilities. We propose a novel unsupervised TTT technique based on the Mutual of Mutual Information between multi-scale feature maps and a discrete latent representation. Experimental results demonstrate competitive classification performance on different popular test-time adaptation benchmarks.
arXiv Detail & Related papers (2023-10-18T21:43:37Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
SimOn: A Simple Framework for Online Temporal Action Localization [51.27476730635852]
We propose a framework, termed SimOn, that learns to predict action instances using the popular Transformer architecture. Experimental results on the THUMOS14 and ActivityNet1.3 datasets show that our model remarkably outperforms the previous methods.
arXiv Detail & Related papers (2022-11-08T04:50:54Z)
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models [107.05966685291067]
We propose test-time prompt tuning (TPT) to learn adaptive prompts on the fly with a single test sample. TPT improves the zero-shot top-1 accuracy of CLIP by 3.6% on average. In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data.
arXiv Detail & Related papers (2022-09-15T17:55:11Z)
Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets. Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z)
Dense Regression Network for Video Grounding [97.57178850020327]
We use the distances between the frame within the ground truth and the starting (ending) frame as dense supervisions to improve the video grounding accuracy. Specifically, we design a novel dense regression network (DRN) to regress the distances from each frame to the starting (ending) frame of the video segment. We also propose a simple but effective IoU regression head module to explicitly consider the localization quality of the grounding results.
arXiv Detail & Related papers (2020-04-07T17:15:37Z)
Temporally Coherent Embeddings for Self-Supervised Video Representation Learning [2.216657815393579]
This paper presents TCE: Temporally Coherent Embeddings for self-supervised video representation learning. The proposed method exploits inherent structure of unlabeled video data to explicitly enforce temporal coherency in the embedding space. With a simple but effective 2D-CNN backbone and only RGB stream inputs, TCE pre-trained representations outperform all previous selfsupervised 2D-CNN and 3D-CNN pre-trained on UCF101.
arXiv Detail & Related papers (2020-03-21T12:25:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.