Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression
- URL: http://arxiv.org/abs/2409.11718v2
- Date: Sun, 22 Sep 2024 08:23:33 GMT
- Title: Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression
- Authors: Yuan Tian, Guo Lu, Guangtao Zhai,
- Abstract summary: Unsupervised video semantic compression (UVSC) has recently garnered attention.
We propose to boost the UVSC task by absorbing the off-the-shelf rich semantics from VFMs.
We introduce a VFMs-shared semantic alignment layer, complemented by VFM-specific prompts, to flexibly align semantics between the compressed video and various VFMs.
- Score: 54.62883091552163
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unsupervised video semantic compression (UVSC), i.e., compressing videos to better support various analysis tasks, has recently garnered attention. However, the semantic richness of previous methods remains limited, due to the single semantic learning objective, limited training data, etc. To address this, we propose to boost the UVSC task by absorbing the off-the-shelf rich semantics from VFMs. Specifically, we introduce a VFMs-shared semantic alignment layer, complemented by VFM-specific prompts, to flexibly align semantics between the compressed video and various VFMs. This allows different VFMs to collaboratively build a mutually-enhanced semantic space, guiding the learning of the compression model. Moreover, we introduce a dynamic trajectory-based inter-frame compression scheme, which first estimates the semantic trajectory based on the historical content, and then traverses along the trajectory to predict the future semantics as the coding context. This reduces the overall bitcost of the system, further improving the compression efficiency. Our approach outperforms previous coding methods on three mainstream tasks and six datasets.
Related papers
- SMC++: Masked Learning of Unsupervised Video Semantic Compression [54.62883091552163]
We propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics.
MVM is proficient at learning generalizable semantics through the masked patch prediction task.
It may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises.
arXiv Detail & Related papers (2024-06-07T09:06:40Z) - Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models [96.97910688908956]
We introduce the first zero-shot approach for Video Semantic (VSS) based on pre-trained diffusion models.
We propose a framework tailored for VSS based on pre-trained image and video diffusion models.
Experiments show that our proposed approach outperforms existing zero-shot image semantic segmentation approaches.
arXiv Detail & Related papers (2024-05-27T08:39:38Z) - Hierarchical Semantic Contrast for Scene-aware Video Anomaly Detection [14.721615285883423]
We propose a hierarchical semantic contrast (HSC) method to learn a scene-aware VAD model from normal videos.
This hierarchical semantic contrast strategy helps to deal with the diversity of normal patterns and also increases their discrimination ability.
arXiv Detail & Related papers (2023-03-23T05:53:34Z) - Cross Modal Compression: Towards Human-comprehensible Semantic
Compression [73.89616626853913]
Cross modal compression is a semantic compression framework for visual data.
We show that our proposed CMC can achieve encouraging reconstructed results with an ultrahigh compression ratio.
arXiv Detail & Related papers (2022-09-06T15:31:11Z) - Boosting Video-Text Retrieval with Explicit High-Level Semantics [115.66219386097295]
We propose a novel visual-linguistic aligning model named HiSE for VTR.
It improves the cross-modal representation by incorporating explicit high-level semantics.
Our method achieves the superior performance over state-of-the-art methods on three benchmark datasets.
arXiv Detail & Related papers (2022-08-08T15:39:54Z) - A Coding Framework and Benchmark towards Low-Bitrate Video Understanding [63.05385140193666]
We propose a traditional-neural mixed coding framework that takes advantage of both traditional codecs and neural networks (NNs)
The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved.
We build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach.
arXiv Detail & Related papers (2022-02-06T16:29:15Z) - Thousand to One: Semantic Prior Modeling for Conceptual Coding [26.41657489930382]
We propose an end-to-end semantic prior-based conceptual coding scheme towards extremely low image compression.
We employ semantic segmentation maps as structural guidance for extracting deep semantic prior.
A cross-channel entropy model is proposed to further exploit the inter-channel correlation of the spatially independent semantic prior.
arXiv Detail & Related papers (2021-03-12T08:02:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.