Related papers: Deep Multi-modality Soft-decoding of Very Low Bit-rate Face Videos

Deep Multi-modality Soft-decoding of Very Low Bit-rate Face Videos

URL: http://arxiv.org/abs/2008.01652v1
Date: Sun, 2 Aug 2020 04:38:59 GMT
Title: Deep Multi-modality Soft-decoding of Very Low Bit-rate Face Videos
Authors: Yanhui Guo, Xi Zhang, Xiaolin Wu
Abstract summary: We propose a novel deep multi-modality neural network for restoring very low bit rate videos of talking heads. The proposed CNN method exploits the correlations among three modalities, video, audio and emotion state of the speaker, to remove the video compression artifacts.
Score: 23.83907055654182
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a novel deep multi-modality neural network for restoring very low bit rate videos of talking heads. Such video contents are very common in social media, teleconferencing, distance education, tele-medicine, etc., and often need to be transmitted with limited bandwidth. The proposed CNN method exploits the correlations among three modalities, video, audio and emotion state of the speaker, to remove the video compression artifacts caused by spatial down sampling and quantization. The deep learning approach turns out to be ideally suited for the video restoration task, as the complex non-linear cross-modality correlations are very difficult to model analytically and explicitly. The new method is a video post processor that can significantly boost the perceptual quality of aggressively compressed talking head videos, while being fully compatible with all existing video compression standards.

Related papers

HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models [63.65066762436074]
HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks. It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks.
arXiv Detail & Related papers (2025-03-14T15:36:39Z)
When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [112.44822009714461]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding. During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes. Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z)
Resolution-Agnostic Neural Compression for High-Fidelity Portrait Video Conferencing via Implicit Radiance Fields [42.926554334378984]
High fidelity and low bandwidth are two major objectives of video compression for video conferencing applications. We propose a novel low bandwidth neural compression approach for high-fidelity portrait video conferencing.
arXiv Detail & Related papers (2024-02-26T14:29:13Z)
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities [67.89368528234394]
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities. Video and audio are obtained at much higher rates than text and are roughly aligned in time. Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models.
arXiv Detail & Related papers (2023-11-09T19:15:12Z)
Towards Scalable Neural Representation for Diverse Videos [68.73612099741956]
Implicit neural representations (INR) have gained increasing attention in representing 3D scenes and images. Existing INR-based methods are limited to encoding a handful of short videos with redundant visual content. This paper focuses on developing neural representations for encoding long and/or a large number of videos with diverse visual content.
arXiv Detail & Related papers (2023-03-24T16:32:19Z)
Gemino: Practical and Robust Neural Compression for Video Conferencing [19.137804113000474]
Gemino is a new neural compression system for video conferencing based on a novel high-frequency-conditional super-resolution pipeline. We show that Gemino operates on videos in real-time on a Titan X GPU, and achieves 2.2-5x lower than traditional video codecs for the same perceptual quality.
arXiv Detail & Related papers (2022-09-21T17:10:46Z)
A Coding Framework and Benchmark towards Low-Bitrate Video Understanding [63.05385140193666]
We propose a traditional-neural mixed coding framework that takes advantage of both traditional codecs and neural networks (NNs) The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved. We build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach.
arXiv Detail & Related papers (2022-02-06T16:29:15Z)
Leveraging Bitstream Metadata for Fast, Accurate, Generalized Compressed Video Quality Enhancement [74.1052624663082]
We develop a deep learning architecture capable of restoring detail to compressed videos. We show that this improves restoration accuracy compared to prior compression correction methods. We condition our model on quantization data which is readily available in the bitstream.
arXiv Detail & Related papers (2022-01-31T18:56:04Z)
Multi-modality Deep Restoration of Extremely Compressed Face Videos [36.83490465562509]
We develop a multi-modality deep convolutional neural network method for restoring face videos that are aggressively compressed. The main innovation is a new DCNN architecture that incorporates known priors of multiple modalities. Ample empirical evidences are presented to validate the superior performance of the proposed DCNN method on face videos.
arXiv Detail & Related papers (2021-07-05T16:29:02Z)
Ultra-low bitrate video conferencing using deep image animation [7.263312285502382]
We propose a novel deep learning approach for ultra-low video compression for video conferencing applications. We employ deep neural networks to encode motion information as keypoint displacement and reconstruct the video signal at the decoder side.
arXiv Detail & Related papers (2020-12-01T09:06:34Z)
Content Adaptive and Error Propagation Aware Deep Video Compression [110.31693187153084]
We propose a content adaptive and error propagation aware video compression system. Our method employs a joint training strategy by considering the compression performance of multiple consecutive frames instead of a single frame. Instead of using the hand-crafted coding modes in the traditional compression systems, we design an online encoder updating scheme in our system.
arXiv Detail & Related papers (2020-03-25T09:04:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.