Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech
- URL: http://arxiv.org/abs/2412.11409v3
- Date: Wed, 15 Jan 2025 01:59:02 GMT
- Title: Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech
- Authors: Rui Liu, Shuwei He, Yifan Hu, Haizhou Li,
- Abstract summary: M2SE-VTTS aims to take the environmental image as the prompt to synthesize the reverberant speech for the spoken content.<n>We propose a novel multi-modal and multi-scale spatial environment understanding scheme to achieve immersive VTTS.<n>Our model outperforms the advanced baselines in environmental speech generation.
- Score: 39.74416731035842
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize the reverberant speech for the spoken content. The challenge of this task lies in understanding the spatial environment from the image. Many attempts have been made to extract global spatial visual information from the RGB space of an spatial image. However, local and depth image information are crucial for understanding the spatial environment, which previous works have ignored. To address the issues, we propose a novel multi-modal and multi-scale spatial environment understanding scheme to achieve immersive VTTS, termed M2SE-VTTS. The multi-modal aims to take both the RGB and Depth spaces of the spatial image to learn more comprehensive spatial information, and the multi-scale seeks to model the local and global spatial knowledge simultaneously. Specifically, we first split the RGB and Depth images into patches and adopt the Gemini-generated environment captions to guide the local spatial understanding. After that, the multi-modal and multi-scale features are integrated by the local-aware global spatial understanding. In this way, M2SE-VTTS effectively models the interactions between local and global spatial contexts in the multi-modal spatial environment. Objective and subjective evaluations suggest that our model outperforms the advanced baselines in environmental speech generation. The code and audio samples are available at: https://github.com/AI-S2-Lab/M2SE-VTTS.
Related papers
- EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery [15.581788175591097]
It is challenging to adapt natural spatial models to remote sensing imagery.
EarthGPT-X offers zoom-in and zoom-out insight, and possesses flexible multi-grained interactive abilities.
Experiments conducted demonstrate the superiority of the proposed EarthGPT-X in multi-grained tasks.
arXiv Detail & Related papers (2025-04-17T09:56:35Z) - LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding [29.42797944919497]
We propose LLaVA-ST, a MLLM for fine-grained spatial-temporal multimodal understanding.
In LLaVA-ST, we propose Language-Aligned Positional Embedding, which embeds the coordinate special token into the visual space.
We also design the Spatial-Temporal Packer, which decouples the feature compression of temporal and spatial resolutions into two distinct point-to-region attention processing streams.
arXiv Detail & Related papers (2025-01-14T17:58:12Z) - Agent Journey Beyond RGB: Unveiling Hybrid Semantic-Spatial Environmental Representations for Vision-and-Language Navigation [15.302043040651368]
Navigating unseen environments based on natural language instructions remains difficult for egocentric agents in Vision-and-Language Navigation (VLN)<n>We propose a versatile Semantic Understanding and Spatial Awareness architecture to facilitate navigation.<n>We show that SUSA hybrid semantic-spatial representations effectively enhance navigation performance, setting new state-of-the-art performance across three VLN benchmarks (REVERIE, R2R, and SOON)
arXiv Detail & Related papers (2024-12-09T13:10:28Z) - Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech [39.206005299985605]
Visual Text-to-Speech (VTTS) aims to take the spatial environmental image as the prompt to synthesize the reverberation speech for the spoken content.
Previous research focused on the RGB modality for global environmental modeling, overlooking the potential of multi-source spatial knowledge like depth, speaker position, and environmental semantics.
We propose a novel multi-source spatial knowledge understanding scheme for immersive VTTS, termed MS$2$KU-VTTS.
arXiv Detail & Related papers (2024-10-18T00:46:18Z) - GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot
Attention for Vision-and-Language Navigation [52.65506307440127]
We propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation.
We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations.
arXiv Detail & Related papers (2023-05-26T17:15:22Z) - CLEAR: Improving Vision-Language Navigation with Cross-Lingual,
Environment-Agnostic Representations [98.30038910061894]
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions.
We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations.
Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
arXiv Detail & Related papers (2022-07-05T17:38:59Z) - SILG: The Multi-environment Symbolic Interactive Language Grounding
Benchmark [62.34200575624785]
We propose the multi-environment Interactive Language Grounding benchmark (SILG)
SILG consists of grid-world environments that require generalization to new dynamics, entities, and partially observed worlds (RTFM, Messenger, NetHack)
We evaluate recent advances such as egocentric local convolution, recurrent state-tracking, entity-centric attention, and pretrained LM using SILG.
arXiv Detail & Related papers (2021-10-20T17:02:06Z) - Low Light Image Enhancement via Global and Local Context Modeling [164.85287246243956]
We introduce a context-aware deep network for low-light image enhancement.
First, it features a global context module that models spatial correlations to find complementary cues over full spatial domain.
Second, it introduces a dense residual block that captures local context with a relatively large receptive field.
arXiv Detail & Related papers (2021-01-04T09:40:54Z) - NeuralFusion: Online Depth Fusion in Latent Space [77.59420353185355]
We present a novel online depth map fusion approach that learns depth map aggregation in a latent feature space.
Our approach is real-time capable, handles high noise levels, and is particularly able to deal with gross outliers common for photometric stereo-based depth maps.
arXiv Detail & Related papers (2020-11-30T13:50:59Z) - A Multi-Level Approach to Waste Object Segmentation [10.20384144853726]
We address the problem of localizing waste objects from a color image and an optional depth image.
Our method integrates the intensity and depth information at multiple levels of spatial granularity.
We create a new RGBD waste object segmentation, MJU-Waste, that is made public to facilitate future research in this area.
arXiv Detail & Related papers (2020-07-08T16:49:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.