Related papers: Semantic Face Compression for Metaverse: A Compact 3D Descriptor Based Approach

Semantic Face Compression for Metaverse: A Compact 3D Descriptor Based Approach

URL: http://arxiv.org/abs/2311.12817v1
Date: Sun, 24 Sep 2023 13:39:50 GMT
Title: Semantic Face Compression for Metaverse: A Compact 3D Descriptor Based Approach
Authors: Binzhe Li, Bolin Chen, Zhao Wang, Shiqi Wang, Yan Ye
Abstract summary: We envision a new metaverse communication paradigm for virtual avatar faces, and develop the semantic face compression with compact 3D facial descriptors. The proposed scheme is expected to enable numerous applications, such as digital human communication based on machine analysis.
Score: 15.838410034900138
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this letter, we envision a new metaverse communication paradigm for virtual avatar faces, and develop the semantic face compression with compact 3D facial descriptors. The fundamental principle is that the communication of virtual avatar faces primarily emphasizes the conveyance of semantic information. In light of this, the proposed scheme offers the advantages of being highly flexible, efficient and semantically meaningful. The semantic face compression, which allows the communication of the descriptors for artificial intelligence based understanding, could facilitate numerous applications without the involvement of humans in metaverse. The promise of the proposed paradigm is also demonstrated by performance comparisons with the state-of-the-art video coding standard, Versatile Video Coding. A significant improvement in terms of rate-accuracy performance has been achieved. The proposed scheme is expected to enable numerous applications, such as digital human communication based on machine analysis, and to form the cornerstone of interaction and communication in the metaverse.

Related papers

Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset [113.25650486482762]
We introduce the Seamless Interaction dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage.<n>This dataset enables the development of AI technologies that understand dyadic embodied dynamics.<n>We develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech.
arXiv Detail & Related papers (2025-06-27T18:09:49Z)
Compressing Human Body Video with Interactive Semantics: A Generative Approach [30.403440387272575]
We propose to compress human body video with interactive semantics.<n>The proposed encoder employs a 3D human model to disentangle nonlinear dynamics and complex motion of human body signal.<n>The proposed decoder can evolve the mesh-based motion fields to realize the high-quality human body video reconstruction.
arXiv Detail & Related papers (2025-05-22T02:51:58Z)
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
Multimodal representation learning plays an important role in the artificial intelligence domain. We introduce a new approach that models video-text as game players using multivariate cooperative game theory. We extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks.
arXiv Detail & Related papers (2024-12-30T14:09:15Z)
3D Vision-Language Gaussian Splatting [29.047044145499036]
Multi-modal 3D scene understanding has vital applications in robotics, autonomous driving, and virtual/augmented reality. We propose a solution that achieves adequately handles the distinct visual and semantic modalities. We also employ a camera-view blending technique to improve semantic consistency between existing views.
arXiv Detail & Related papers (2024-10-10T03:28:29Z)
Meta-Learning Empowered Meta-Face: Personalized Speaking Style Adaptation for Audio-Driven 3D Talking Face Animation [9.67450435520651]
This paper introduces MetaFace, a novel methodology crafted for speaking style adaptation. It is composed of several key components: the Robust Meta Initialization Stage (RMIS) for fundamental speaking style adaptation, the Dynamic Relation Mining Neural Process (NDRM) for forging connections between observed and unobserved speaking styles, and the Low-rank Matrix Memory Reduction Approach to enhance the efficiency of model optimization.
arXiv Detail & Related papers (2024-08-18T04:42:43Z)
Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters. Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z)
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z)
Scalable Face Image Coding via StyleGAN Prior: Towards Compression for Human-Machine Collaborative Vision [39.50768518548343]
We investigate how hierarchical representations derived from the advanced generative prior facilitate constructing an efficient scalable coding paradigm for human-machine collaborative vision. Our key insight is that by exploiting the StyleGAN prior, we can learn three-layered representations encoding hierarchical semantics, which are elaborately designed into the basic, middle, and enhanced layers. Based on the multi-task scalable rate-distortion objective, the proposed scheme is jointly optimized to achieve optimal machine analysis performance, human perception experience, and compression ratio.
arXiv Detail & Related papers (2023-12-25T05:57:23Z)
GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model. It can synthesize smooth lip dynamics while preserving the speaker's identity. Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z)
Parametric Implicit Face Representation for Audio-Driven Facial Reenactment [52.33618333954383]
We propose a novel audio-driven facial reenactment framework that is both controllable and can generate high-quality talking heads. Specifically, our parametric implicit representation parameterizes the implicit representation with interpretable parameters of 3D face models. Our method can generate more realistic results than previous methods with greater fidelity to the identities and talking styles of speakers.
arXiv Detail & Related papers (2023-06-13T07:08:22Z)
Interactive Face Video Coding: A Generative Compression Framework [18.26476468644723]
We propose a novel framework for Interactive Face Video Coding (IFVC), which allows humans to interact with the intrinsic visual representations instead of the signals. The proposed solution enjoys several distinct advantages, including ultra-compact representation, low delay interaction, and vivid expression and headpose animation.
arXiv Detail & Related papers (2023-02-20T11:24:23Z)
VIRT: Improving Representation-based Models for Text Matching through Virtual Interaction [50.986371459817256]
We propose a novel textitVirtual InteRacTion mechanism, termed as VIRT, to enable full and deep interaction modeling in representation-based models. VIRT asks representation-based encoders to conduct virtual interactions to mimic the behaviors as interaction-based models do.
arXiv Detail & Related papers (2021-12-08T09:49:28Z)
Towards Modality Transferable Visual Information Representation with Optimal Model Compression [67.89885998586995]
We propose a new scheme for visual signal representation that leverages the philosophy of transferable modality. The proposed framework is implemented on the state-of-the-art video coding standard.
arXiv Detail & Related papers (2020-08-13T01:52:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.