Large Model Empowered Metaverse: State-of-the-Art, Challenges and Opportunities
- URL: http://arxiv.org/abs/2502.10397v1
- Date: Sat, 18 Jan 2025 13:52:48 GMT
- Title: Large Model Empowered Metaverse: State-of-the-Art, Challenges and Opportunities
- Authors: Yuntao Wang, Qinnan Hu, Zhou Su, Linkang Du, Qichao Xu,
- Abstract summary: The Metaverse is an immersive, persistent digital ecosystem where users can interact, socialize, and work within 3D virtual environments.<n>This paper investigates the integration of large models within the Metaverse.<n>We propose a generative AI-based framework for optimizing Metaverse rendering.
- Score: 23.465545107612595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Metaverse represents a transformative shift beyond traditional mobile Internet, creating an immersive, persistent digital ecosystem where users can interact, socialize, and work within 3D virtual environments. Powered by large models such as ChatGPT and Sora, the Metaverse benefits from precise large-scale real-world modeling, automated multimodal content generation, realistic avatars, and seamless natural language understanding, which enhance user engagement and enable more personalized, intuitive interactions. However, challenges remain, including limited scalability, constrained responsiveness, and low adaptability in dynamic environments. This paper investigates the integration of large models within the Metaverse, examining their roles in enhancing user interaction, perception, content creation, and service quality. To address existing challenges, we propose a generative AI-based framework for optimizing Metaverse rendering. This framework includes a cloud-edge-end collaborative model to allocate rendering tasks with minimal latency, a mobility-aware pre-rendering mechanism that dynamically adjusts to user movement, and a diffusion model-based adaptive rendering strategy to fine-tune visual details. Experimental results demonstrate the effectiveness of our approach in enhancing rendering efficiency and reducing rendering overheads, advancing large model deployment for a more responsive and immersive Metaverse.
Related papers
- Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.
Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.
Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z) - Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.<n>To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.<n> Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z) - Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback [130.090296560882]
We investigate the use of feedback to enhance the object dynamics in text-to-video models.<n>We show that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions.
arXiv Detail & Related papers (2024-12-03T17:44:23Z) - Haptic Repurposing with GenAI [5.424247121310253]
Mixed Reality aims to merge the digital and physical worlds to create immersive human-computer interactions.
This paper introduces Haptic Repurposing with GenAI, an innovative approach to enhance MR interactions by transforming any physical objects into adaptive haptic interfaces for AI-generated virtual assets.
arXiv Detail & Related papers (2024-06-11T13:06:28Z) - iVideoGPT: Interactive VideoGPTs are Scalable World Models [70.02290687442624]
World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making.
This work introduces Interactive VideoGPT, a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens.
iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations.
arXiv Detail & Related papers (2024-05-24T05:29:12Z) - FusionMamba: Dynamic Feature Enhancement for Multimodal Image Fusion with Mamba [19.761723108363796]
FusionMamba aims to overcome the challenges faced by CNNs and Vision Transformers (ViTs) in computer vision tasks.
The framework improves the visual state-space model Mamba by integrating dynamic convolution and channel attention mechanisms.
Experiments show that FusionMamba achieves state-of-the-art performance in a variety of multimodal image fusion tasks as well as downstream experiments.
arXiv Detail & Related papers (2024-04-15T06:37:21Z) - VR-GS: A Physical Dynamics-Aware Interactive Gaussian Splatting System in Virtual Reality [39.53150683721031]
Our proposed VR-GS system represents a leap forward in human-centered 3D content interaction.
The components of our Virtual Reality system are designed for high efficiency and effectiveness.
arXiv Detail & Related papers (2024-01-30T01:28:36Z) - TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control.
A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects.
generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z) - ACID: Action-Conditional Implicit Visual Dynamics for Deformable Object
Manipulation [135.10594078615952]
We introduce ACID, an action-conditional visual dynamics model for volumetric deformable objects.
A benchmark contains over 17,000 action trajectories with six types of plush toys and 78 variants.
Our model achieves the best performance in geometry, correspondence, and dynamics predictions.
arXiv Detail & Related papers (2022-03-14T04:56:55Z) - VIRT: Improving Representation-based Models for Text Matching through
Virtual Interaction [50.986371459817256]
We propose a novel textitVirtual InteRacTion mechanism, termed as VIRT, to enable full and deep interaction modeling in representation-based models.
VIRT asks representation-based encoders to conduct virtual interactions to mimic the behaviors as interaction-based models do.
arXiv Detail & Related papers (2021-12-08T09:49:28Z) - Generative Adversarial Transformers [13.633811200719627]
We introduce the GANsformer, a novel and efficient type of transformer, and explore it for the task of visual generative modeling.
The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linearly efficiency.
We show it achieves state-of-the-art results in terms of image quality and diversity, while enjoying fast learning and better data-efficiency.
arXiv Detail & Related papers (2021-03-01T18:54:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.