TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
Scale
- URL: http://arxiv.org/abs/2305.14173v1
- Date: Tue, 23 May 2023 15:44:56 GMT
- Title: TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
Scale
- Authors: Ziyun Zeng, Yixiao Ge, Zhan Tong, Xihui Liu, Shu-Tao Xia, Ying Shan
- Abstract summary: We analyze the factor that leads to degradation from the perspective of language supervision.
We propose a tunable-free pre-training strategy to retain the generalization ability of the text encoder.
We produce a series of models, dubbed TVTSv2, with up to one billion parameters.
- Score: 59.01246141215051
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ultimate goal for foundation models is realizing task-agnostic, i.e.,
supporting out-of-the-box usage without task-specific fine-tuning. Although
breakthroughs have been made in natural language processing and image
representation learning, it is still challenging for video models to reach it
due to the increasing uncertainty of spatiotemporal signals. To ease training,
existing works leverage image foundation models' prior knowledge and equip them
with efficient temporal modules. Despite the satisfactory fine-tuning
performance, we empirically find they fall short of out-of-the-box usage, given
the even degraded performance in zero-shot/linear protocols compared to their
baseline counterparts. In this work, we analyze the factor that leads to
degradation from the perspective of language supervision distortion. We argue
that tuning a text encoder end-to-end, as done in previous work, is suboptimal
since it may overfit in terms of styles, thereby losing its original
generalization ability to capture the semantics of various language registers.
The overfitted text encoder, in turn, provides a harmful supervision signal,
degrading the video representation. To tackle this issue, we propose a
degradation-free pre-training strategy to retain the generalization ability of
the text encoder via freezing shallow layers while enabling the task-related
semantics capturing in tunable deep layers. As for the training objective, we
adopted the transcript sorting task in TVTS incorporated with masking
techniques to enable scalable training. As a result, we produce a series of
models, dubbed TVTSv2, with up to one billion parameters. We achieve new
state-of-the-arts on various video benchmarks with a frozen backbone,
surpassing the recent ImageBind, InternVideo, etc. Code is available at
https://github.com/TencentARC/TVTS.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - Enhancing Diffusion Models with Text-Encoder Reinforcement Learning [63.41513909279474]
Text-to-image diffusion models are typically trained to optimize the log-likelihood objective.
Recent research addresses this issue by refining the diffusion U-Net using human rewards through reinforcement learning or direct backpropagation.
We demonstrate that by finetuning the text encoder through reinforcement learning, we can enhance the text-image alignment of the results.
arXiv Detail & Related papers (2023-11-27T09:39:45Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Curriculum Learning for Recurrent Video Object Segmentation [2.3376061255029064]
This work explores different schedule sampling and frame skipping variations to significantly improve the performance of a recurrent architecture.
Our results on the car class of the KITTI-MOTS challenge indicate that, surprisingly, an inverse schedule sampling is a better option than a classic forward one.
arXiv Detail & Related papers (2020-08-15T10:51:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.