Related papers: InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

URL: http://arxiv.org/abs/2512.01342v1
Date: Mon, 01 Dec 2025 06:57:39 GMT
Title: InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
Authors: Chenting Wang, Yuhan Zhu, Yicheng Xu, Jiange Yang, Ziang Yan, Yali Wang, Yi Wang, Limin Wang,
Abstract summary: Large-scale video-text pretraining achieves strong performance but depends on noisy synthetictemporal with limited semantic coverage.<n>Masked video modeling (MVM) directly exploits trails but text-supervised methods on general tasks.<n>We propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet detail-preserving latent space for this world model.
Score: 29.40602634269908
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large-scale video-text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks. We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning. To address these, we disentangle the traditional encoder-decoder design into an Encoder-Predictor-Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet detail-preserving latent space for this world model. First, conventional linear decoder in pixel MVM enforces the predictor output latent to be linearly projected to, thus separable in pixel space, causing the conflict with semantic abstraction. Our Stage 1 proposes a conditional diffusion decoder and injects reliable image-level semantic priors to enhance semantics and convergence, thus bridging pixel-level fidelity with high-level semantic abstraction. Stage 2 further learns world knowledge by predicting frozen Stage 1 targets within this space, mitigating shortcut learning. Trained on public, unlabeled videos, InternVideo-Next achieves state-of-the-art results across benchmarks and provides a scalable path toward general video representation learning.

Related papers

SemanticGen: Video Generation in Semantic Space [60.49729308406981]
State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder.<n>We introduce SemanticGen, a novel solution to generate videos in the semantic space.<n>Our method is also effective and computationally efficient when extended to long video generation.
arXiv Detail & Related papers (2025-12-23T18:59:56Z)
Generative Video Matting [57.186684844156595]
Video matting has traditionally been limited by the lack of high-quality ground-truth data.<n>Most existing video matting datasets provide only human-annotated imperfect alpha and foreground annotations.<n>We introduce a novel video matting approach that can effectively leverage the rich priors from pre-trained video diffusion models.
arXiv Detail & Related papers (2025-08-11T12:18:55Z)
FRAME: Pre-Training Video Feature Representations via Anticipation and Memory [55.046881477209695]
FRAME is a self-supervised video frame encoder tailored for dense video understanding.<n>It learns to predict current and future DINO patch features from past and present RGB frames.<n>It consistently outperforms image encoders and existing self-supervised video models.
arXiv Detail & Related papers (2025-06-05T19:44:47Z)
Perception Encoder: The best visual embeddings are not at the output of the network [70.86738083862099]
We introduce Perception (PE), a vision encoder for image and video understanding trained via simple vision-language learning.<n>We find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks.<n>Together, our PE family of models achieves best-in-class results on a wide variety of tasks.
arXiv Detail & Related papers (2025-04-17T17:59:57Z)
Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos [70.07088203106443]
Existing methods rely on explicit knowledge to learn motion, resulting in suboptimal representations.<n>Prior Masked Autoentangler (MAE) frameworks struggle to bridge the gap between low-level geometry and high-level dynamics in 4D data.<n>We propose a novel self-disentangled MAE for learning expressive,riminative, and transferable 4D representations.
arXiv Detail & Related papers (2025-04-07T08:47:36Z)
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations [33.74746234704817]
Video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description.<n>This is challenging as it involves deep vision-level understanding, pixel-level dense prediction andtemporal reasoning.<n>We propose bfReferDINO RVOS that inherits region-level vision-text alignment from foundational visual grounding models.
arXiv Detail & Related papers (2025-01-24T16:24:15Z)
Hierarchical Semantic Contrast for Scene-aware Video Anomaly Detection [14.721615285883423]
We propose a hierarchical semantic contrast (HSC) method to learn a scene-aware VAD model from normal videos. This hierarchical semantic contrast strategy helps to deal with the diversity of normal patterns and also increases their discrimination ability.
arXiv Detail & Related papers (2023-03-23T05:53:34Z)
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding [35.73830796500975]
We present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT) To generate the above template under sufficient video- perception, an encoder-decoder architecture is proposed for effective global context modeling. Our method outperforms previous state-of-the-arts with clear margins on two challenging video benchmarks.
arXiv Detail & Related papers (2022-09-27T11:13:04Z)
In-N-Out Generative Learning for Dense Unsupervised Video Segmentation [89.21483504654282]
In this paper, we focus on the unsupervised Video Object (VOS) task which learns visual correspondence from unlabeled videos. We propose the In-aNd-Out (INO) generative learning from a purely generative perspective, which captures both high-level and fine-grained semantics. Our INO outperforms previous state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-03-29T07:56:21Z)
Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance. We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset. Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.