Towards Inclusive Communication: A Unified LLM-Based Framework for Sign Language, Lip Movements, and Audio Understanding
- URL: http://arxiv.org/abs/2508.20476v1
- Date: Thu, 28 Aug 2025 06:51:42 GMT
- Title: Towards Inclusive Communication: A Unified LLM-Based Framework for Sign Language, Lip Movements, and Audio Understanding
- Authors: Jeong Hun Yeo, Hyeongseop Rha, Sungjune Park, Junil Won, Yong Man Ro,
- Abstract summary: We introduce the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation.<n>We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or better than state-of-the-art models specialized for individual tasks.
- Score: 52.859261069569165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio is the primary modality for human communication and has driven the success of Automatic Speech Recognition (ASR) technologies. However, such systems remain inherently inaccessible to individuals who are deaf or hard of hearing. Visual alternatives such as sign language and lip reading offer effective substitutes, and recent advances in Sign Language Translation (SLT) and Visual Speech Recognition (VSR) have improved audio-less communication. Yet, these modalities have largely been studied in isolation, and their integration within a unified framework remains underexplored. In this paper, we introduce the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation. We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or superior to state-of-the-art models specialized for individual tasks. Building on this framework, we achieve performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and AVSR. Furthermore, our analysis reveals that explicitly modeling lip movements as a separate modality significantly improves SLT performance.
Related papers
- ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction [88.41471266579333]
We propose ELEGANCE, a novel framework that incorporates linguistic knowledge from large language models (LLMs) into AV-TSE models.<n> Comprehensive experiments with RoBERTa, Qwen3-0.6B, and Qwen3-4B on two AV-TSE backbones show significant improvements.
arXiv Detail & Related papers (2025-11-09T08:50:11Z) - MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance [66.74042564585942]
MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance.<n>Our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
arXiv Detail & Related papers (2025-10-01T04:32:37Z) - Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations [18.74784108693223]
Transformer-based speech language models (SLMs) have significantly improved neural speech recognition and understanding.<n>The extent to which SLMs encode nuanced syntactic and conceptual features remains unclear.<n>This study is the first to systematically evaluate the presence of contextual syntactic and semantic features across SLMs.
arXiv Detail & Related papers (2025-09-19T06:29:33Z) - Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition [34.08564665311891]
directional-SpeechLlama is a novel approach that leverages the microphone array of smart glasses to achieve directional speech recognition.<n> Experimental results show that our proposed directional-SpeechLlama effectively captures the relationship between textual cues and spatial audio.
arXiv Detail & Related papers (2025-06-17T20:49:41Z) - Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction [87.49303116989708]
We explore the potential of pre-trained speech-language models (PSLMs) and pre-trained language models (PLMs) as auxiliary knowledge sources for AV-TSE.<n>In this study, we propose incorporating the linguistic constraints from PSLMs or PLMs for the AV-TSE model as additional supervision signals.<n>Without any extra computational cost during inference, the proposed approach consistently improves speech quality and intelligibility.
arXiv Detail & Related papers (2025-06-11T14:36:26Z) - Linguistic Knowledge Transfer Learning for Speech Enhancement [29.191204225828354]
Linguistic knowledge plays a crucial role in spoken language comprehension.<n>Most speech enhancement methods rely on acoustic features to learn the mapping relationship between noisy and clean speech.<n>We propose the Cross-Modality Knowledge Transfer (CMKT) learning framework to integrate linguistic knowledge into SE models.
arXiv Detail & Related papers (2025-03-10T09:00:18Z) - Bridging The Multi-Modality Gaps of Audio, Visual and Linguistic for Speech Enhancement [36.136070412464214]
Speech enhancement (SE) aims to improve the quality and intelligibility of speech in noisy environments.<n>Recent studies have shown that incorporating visual cues in audio signal processing can enhance SE performance.<n>We propose a novel multi-modal learning framework, termed DLAV-SE, which leverages a diffusion-based model integrating audio, visual, and linguistic information.
arXiv Detail & Related papers (2025-01-23T04:36:29Z) - VILAS: Exploring the Effects of Vision and Language Context in Automatic
Speech Recognition [18.19998336526969]
ViLaS (Vision and Language into Automatic Speech Recognition) is a novel multimodal ASR model based on the continuous integrate-and-fire (CIF) mechanism.
To explore the effects of integrating vision and language, we create VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese and English versions.
arXiv Detail & Related papers (2023-05-31T16:01:20Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.