Related papers: When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding

When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding

URL: http://arxiv.org/abs/2408.08093v1
Date: Thu, 15 Aug 2024 11:36:18 GMT
Title: When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding
Authors: Pingping Zhang, Jinlong Li, Meng Wang, Nicu Sebe, Sam Kwong, Shiqi Wang,
Abstract summary: Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding. During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes. Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
Score: 112.44822009714461
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing codecs are designed to eliminate intrinsic redundancies to create a compact representation for compression. However, strong external priors from Multimodal Large Language Models (MLLMs) have not been explicitly explored in video compression. Herein, we introduce a unified paradigm for Cross-Modality Video Coding (CMVC), which is a pioneering approach to explore multimodality representation and video generative models in video coding. Specifically, on the encoder side, we disentangle a video into spatial content and motion components, which are subsequently transformed into distinct modalities to achieve very compact representation by leveraging MLLMs. During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes that optimize video reconstruction quality for specific decoding requirements, including Text-Text-to-Video (TT2V) mode to ensure high-quality semantic information and Image-Text-to-Video (IT2V) mode to achieve superb perceptual consistency. In addition, we propose an efficient frame interpolation model for IT2V mode via Low-Rank Adaption (LoRA) tuning to guarantee perceptual quality, which allows the generated motion cues to behave smoothly. Experiments on benchmarks indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency. These results highlight potential directions for future research in video coding.

Related papers

RISE-T2V: Rephrasing and Injecting Semantics with LLM for Expansive Text-to-Video Generation [19.127189099122244]
We introduce RISE-T2V, which uniquely integrates the processes of prompt rephrasing and semantic feature extraction into a single step.<n>We propose an innovative module called the Rephrasing Adapter, enabling diffusion models to utilize text hidden states.
arXiv Detail & Related papers (2025-11-06T12:42:03Z)
HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models [63.65066762436074]
HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks. It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks.
arXiv Detail & Related papers (2025-03-14T15:36:39Z)
REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder [52.698595889988766]
We present a novel perspective on learning video embedders for generative modeling. Rather than requiring an exact reproduction of an input video, an effective embedder should focus on visually plausible reconstructions. We propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework.
arXiv Detail & Related papers (2025-03-11T17:51:07Z)
M3-CVC: Controllable Video Compression with Multimodal Generative Models [17.49397141459785]
M3-CVC is a controllable video compression framework incorporating generative models. We show that M3-CVC significantly outperforms the state-the-art VVC standard in ultralow scenarios.
arXiv Detail & Related papers (2024-11-24T11:56:59Z)
Generative Human Video Compression with Multi-granularity Temporal Trajectory Factorization [13.341123726068652]
We propose a novel Multi-granularity Temporal Trajectory Factorization framework for generative human video compression. Experimental results show that proposed method outperforms latest generative models and the state-of-the-art video coding standard Versatile Video Coding.
arXiv Detail & Related papers (2024-10-14T05:34:32Z)
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens. DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z)
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [55.515836117658985]
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer. It can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.
arXiv Detail & Related papers (2024-08-12T11:47:11Z)
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding [15.959757105308238]
Video LMMs rely on either image or video encoders to process visual inputs, each of which has its own limitations. We introduce VideoGPT+, which combines the complementary benefits of the image encoder (for detailed spatial understanding) and the video encoder (for global temporal context modeling) Our architecture showcases improved performance across multiple video benchmarks, including VCGBench, MVBench and Zero-shot question-answering.
arXiv Detail & Related papers (2024-06-13T17:59:59Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
Implicit-explicit Integrated Representations for Multi-view Video Compression [40.86402535896703]
We propose an implicit-explicit integrated representation for multi-view video compression. The proposed framework combines the strengths of both implicit neural representation and explicit 2D datasets. Our proposed framework can achieve comparable or even superior performance to the latest multi-view video compression standard MIV.
arXiv Detail & Related papers (2023-11-29T04:15:57Z)
Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps. We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process. Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z)
An Emerging Coding Paradigm VCM: A Scalable Coding Approach Beyond Feature and Signal [99.49099501559652]
Video Coding for Machine (VCM) aims to bridge the gap between visual feature compression and classical video coding. We employ a conditional deep generation network to reconstruct video frames with the guidance of learned motion pattern. By learning to extract sparse motion pattern via a predictive model, the network elegantly leverages the feature representation to generate the appearance of to-be-coded frames.
arXiv Detail & Related papers (2020-01-09T14:18:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.