Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech
- URL: http://arxiv.org/abs/2410.14101v1
- Date: Fri, 18 Oct 2024 00:46:18 GMT
- Title: Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech
- Authors: Shuwei He, Rui Liu, Haizhou Li,
- Abstract summary: Visual Text-to-Speech (VTTS) aims to take the spatial environmental image as the prompt to synthesize the reverberation speech for the spoken content.
Previous research focused on the RGB modality for global environmental modeling, overlooking the potential of multi-source spatial knowledge like depth, speaker position, and environmental semantics.
We propose a novel multi-source spatial knowledge understanding scheme for immersive VTTS, termed MS$2$KU-VTTS.
- Score: 39.206005299985605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Text-to-Speech (VTTS) aims to take the spatial environmental image as the prompt to synthesize the reverberation speech for the spoken content. Previous research focused on the RGB modality for global environmental modeling, overlooking the potential of multi-source spatial knowledge like depth, speaker position, and environmental semantics. To address the issues, we propose a novel multi-source spatial knowledge understanding scheme for immersive VTTS, termed MS$^2$KU-VTTS. Specifically, we first prioritize RGB image as the dominant source and consider depth image, speaker position knowledge from object detection, and semantic captions from image understanding LLM as supplementary sources. Afterwards, we propose a serial interaction mechanism to deeply engage with both dominant and supplementary sources. The resulting multi-source knowledge is dynamically integrated based on their contributions.This enriched interaction and integration of multi-source spatial knowledge guides the speech generation model, enhancing the immersive spatial speech experience.Experimental results demonstrate that the MS$^2$KU-VTTS surpasses existing baselines in generating immersive speech. Demos and code are available at: https://github.com/MS2KU-VTTS/MS2KU-VTTS.
Related papers
- Cognitive Disentanglement for Referring Multi-Object Tracking [28.325814292139686]
We propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework.
CDRMT adapts the "what" and "where" pathways from the human visual processing system to RMOT tasks.
Experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods.
arXiv Detail & Related papers (2025-03-14T15:21:54Z) - Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech [39.74416731035842]
M2SE-VTTS aims to take the environmental image as the prompt to synthesize the reverberant speech for the spoken content.
We propose a novel multi-modal and multi-scale spatial environment understanding scheme to achieve immersive VTTS.
Our model outperforms the advanced baselines in environmental speech generation.
arXiv Detail & Related papers (2024-12-16T03:25:23Z) - Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling [40.32021786228235]
Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting.
We propose a novel Emphasis Rendering scheme for the CTTS model, termed ER-CTTS.
To address data scarcity, we create emphasis intensity annotations on the existing conversational dataset (DailyTalk)
arXiv Detail & Related papers (2024-10-12T13:02:31Z) - Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI [10.335943413484815]
seamless integration of virtual and physical worlds in augmented reality benefits from the system semantically "understanding" the physical environment.
We introduce a multimodal 3D object representation that unifies both semantic and linguistic knowledge with the geometric representation.
We demonstrate the usefulness of the proposed system through two real-world AR applications on Magic Leap 2: a) spatial search in physical environments with natural language and b) an intelligent inventory system that tracks object changes over time.
arXiv Detail & Related papers (2024-10-06T23:25:21Z) - SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition [71.90536979421093]
We propose a Split-and-Synthesize Prompting with Gated Alignments (SSPA) framework to amplify the potential of Vision-Language Models (VLMs)
We develop an in-context learning approach to associate the inherent knowledge from LLMs.
Then we propose a novel Split-and-Synthesize Prompting (SSP) strategy to first model the generic knowledge and downstream label semantics individually.
arXiv Detail & Related papers (2024-07-30T15:58:25Z) - SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval [82.51117533271517]
Previous works typically only encode RGB videos to obtain high-level semantic features.
Existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training.
We propose a novel sign language representation framework called Semantically Enhanced Dual-Stream.
arXiv Detail & Related papers (2024-07-23T11:31:11Z) - QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition [47.103732403296654]
Multi-source semantic space can be represented as the Cartesian product of single-source sub-spaces.
We introduce a global-to-local quantization mechanism, which distills knowledge from stable global (clip-level) features into local (frame-level) ones.
Experiments demonstrate that our semantically decomposed audio representation significantly improves AVS performance.
arXiv Detail & Related papers (2023-09-29T20:48:44Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Multi-source Semantic Graph-based Multimodal Sarcasm Explanation
Generation [53.97962603641629]
We propose a novel mulTi-source sEmantic grAph-based Multimodal sarcasm explanation scheme, named TEAM.
TEAM extracts the object-level semantic meta-data instead of the traditional global visual features from the input image.
TEAM introduces a multi-source semantic graph that comprehensively characterize the multi-source semantic relations.
arXiv Detail & Related papers (2023-06-29T03:26:10Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot
Attention for Vision-and-Language Navigation [52.65506307440127]
We propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation.
We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations.
arXiv Detail & Related papers (2023-05-26T17:15:22Z) - ESPnet-SE++: Speech Enhancement for Robust Speech Recognition,
Translation, and Understanding [86.47555696652618]
This paper presents recent progress on integrating speech separation and enhancement into the ESPnet toolkit.
A new interface has been designed to combine speech enhancement front-ends with other tasks, including automatic speech recognition (ASR), speech translation (ST), and spoken language understanding (SLU)
Results show that the integration of SE front-ends with back-end tasks is a promising research direction even for tasks besides ASR.
arXiv Detail & Related papers (2022-07-19T18:55:29Z) - CLEAR: Improving Vision-Language Navigation with Cross-Lingual,
Environment-Agnostic Representations [98.30038910061894]
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions.
We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations.
Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
arXiv Detail & Related papers (2022-07-05T17:38:59Z) - Self-Supervised Image-to-Text and Text-to-Image Synthesis [23.587581181330123]
We propose a novel self-supervised deep learning based approach towards learning the cross-modal embedding spaces.
In our approach, we first obtain dense vector representations of images using StackGAN-based autoencoder model and also dense vector representations on sentence-level utilizing LSTM based text-autoencoder.
arXiv Detail & Related papers (2021-12-09T13:54:56Z) - Fully Context-Aware Image Inpainting with a Learned Semantic Pyramid [102.24539566851809]
Restoring reasonable and realistic content for arbitrary missing regions in images is an important yet challenging task.
Recent image inpainting models have made significant progress in generating vivid visual details, but they can still lead to texture blurring or structural distortions.
We propose the Semantic Pyramid Network (SPN) motivated by the idea that learning multi-scale semantic priors can greatly benefit the recovery of locally missing content in images.
arXiv Detail & Related papers (2021-12-08T04:33:33Z) - Multimodal Incremental Transformer with Visual Grounding for Visual
Dialogue Generation [25.57530524167637]
Visual dialogue needs to answer a series of coherent questions on the basis of understanding the visual environment.
Visual grounding aims to explicitly locate related objects in the image guided by textual entities.
multimodal incremental transformer encodes the multi-turn dialogue history combined with visual scene step by step according to the order of the dialogue and then generates a contextually and visually coherent response.
arXiv Detail & Related papers (2021-09-17T11:39:29Z) - VisualHints: A Visual-Lingual Environment for Multimodal Reinforcement
Learning [14.553086325168803]
We present VisualHints, a novel environment for multimodal reinforcement learning (RL) involving text-based interactions along with visual hints (obtained from the environment)
We introduce an extension of the TextWorld cooking environment with the addition of visual clues interspersed throughout the environment.
The goal is to force an RL agent to use both text and visual features to predict natural language action commands for solving the final task of cooking a meal.
arXiv Detail & Related papers (2020-10-26T18:51:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.