Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation
- URL: http://arxiv.org/abs/2411.16789v2
- Date: Mon, 25 Aug 2025 12:12:26 GMT
- Title: Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation
- Authors: Jungeun Kim, Hyeongwoo Jeon, Jongseong Bae, Ha Young Kim,
- Abstract summary: Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language.<n>We propose a novel gloss-free framework called Multimodal Sign Language Translation (MMSLT)<n>Our approach achieves state-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily.
- Score: 14.817951264354022
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called Multimodal Sign Language Translation (MMSLT), which leverages the representational capabilities of off-the-shelf multimodal large language models (MLLMs). Specifically, we use MLLMs to generate detailed textual descriptions of sign language components. Then, through our proposed multimodal-language pre-training module, we integrate these description features with sign video features to align them within the spoken sentence space. Our approach achieves state-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily, highlighting the potential of MLLMs to be utilized effectively in SLT. Code is available at https://github.com/hwjeon98/MMSLT.
Related papers
- Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment [84.39962912136525]
We develop a model for sign language understanding that performs sign language translation (SLT) and sign-subtitle alignment (SSA)<n>Our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA.
arXiv Detail & Related papers (2025-12-08T21:05:46Z) - Reasoning Over the Glyphs: Evaluation of LLM's Decipherment of Rare Scripts [0.6144680854063939]
We introduce a novel approach to construct a multimodal dataset of linguistic puzzles involving rare scripts.
We conduct experiments using prominent models, GPT-4o, Gemini, and Claude 3.5 Sonnet, on linguistic puzzles.
Our findings reveal the strengths and limitations of current AI methods in linguistic decipherment.
arXiv Detail & Related papers (2025-01-29T17:24:19Z) - LLaVA-SLT: Visual Language Tuning for Sign Language Translation [42.20090162339927]
Recent advancements in Sign Language Translation (SLT) have shown promise, yet they often largely lag behind gloss-based approaches in terms of accuracy.
We introduce LLaVA-SLT, a pioneering Large Multimodal Model (LMM) framework designed to leverage the power of Large Language Models (LLMs) through effectively learned visual language embeddings.
Our comprehensive experiments demonstrate that LLaVA-SLT outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-12-21T08:01:08Z) - Enhanced Sign Language Translation between American Sign Language (ASL) and Indian Sign Language (ISL) Using LLMs [0.2678472239880052]
We have come up with a research that hopes to provide a bridge between the users of American Sign Language and the users of spoken language and Indian Sign Language (ISL)
This framework is tasked with key challenges such as automatically dealing with gesture variability and overcoming the linguistic differences between ASL and ISL.
arXiv Detail & Related papers (2024-11-19T17:45:12Z) - LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation [60.02145113467427]
This work introduces a fine-tuning approach that integrates large language models with the pretrained CLIP visual encoder.
To address the challenge of LLMs' autoregressive nature, we propose a caption-to-caption contrastive learning framework.
Our method achieves substantial performance gains on various downstream tasks.
arXiv Detail & Related papers (2024-11-07T18:59:16Z) - SCOPE: Sign Language Contextual Processing with Embedding from LLMs [49.5629738637893]
Sign languages, used by around 70 million Deaf individuals globally, are visual languages that convey visual and contextual information.
Current methods in vision-based sign language recognition ( SLR) and translation (SLT) struggle with dialogue scenes due to limited dataset diversity and the neglect of contextually relevant information.
We introduce SCOPE, a novel context-aware vision-based SLR and SLT framework.
arXiv Detail & Related papers (2024-09-02T08:56:12Z) - An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs [7.630967411418269]
Gloss-free Sign Language Translation (SLT) converts sign videos directly into spoken language sentences without relying on glosses.
This paper emphasizes the importance of capturing the spatial configurations and motion dynamics inherent in sign language.
We introduce Spatial and Motion-based Sign Language Translation (SpaMo), a novel LLM-based SLT framework.
arXiv Detail & Related papers (2024-08-20T07:10:40Z) - ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models [73.34709921061928]
We propose a training-free method to inject visual prompts into Multimodal Large Language Models (MLLMs)
We optimize a learnable latent variable based on an energy function, enhancing the strength of referring regions in the attention map.
Our method offers a promising direction for integrating referring abilities into MLLMs, and supports referring with box, mask, scribble and point.
arXiv Detail & Related papers (2024-07-31T11:40:29Z) - Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training [29.47243668154796]
BLOOMZMMS is a novel model that integrates a multilingual LLM with a multilingual speech encoder.
We demonstrate the transferability of linguistic knowledge from the text to the speech modality.
Our zero-shot evaluation results confirm the robustness of our approach across multiple tasks.
arXiv Detail & Related papers (2024-04-16T21:45:59Z) - LLMs are Good Sign Language Translators [19.259163728870696]
Sign Language Translation is a challenging task that aims to translate sign videos into spoken language.
We propose a novel SignLLM framework to transform sign videos into a language-like representation.
We achieve state-of-the-art gloss-free results on two widely-used SLT benchmarks.
arXiv Detail & Related papers (2024-04-01T05:07:13Z) - Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models [117.20416338476856]
Large language models (LLMs) demonstrate remarkable multilingual capabilities without being pre-trained on specially curated multilingual parallel corpora.
We propose a novel detection method, language activation probability entropy (LAPE), to identify language-specific neurons within LLMs.
Our findings indicate that LLMs' proficiency in processing a particular language is predominantly due to a small subset of neurons.
arXiv Detail & Related papers (2024-02-26T09:36:05Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - Speech Translation with Large Language Models: An Industrial Practice [64.5419534101104]
We introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained large language model (LLM)
By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations.
Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST.
arXiv Detail & Related papers (2023-12-21T05:32:49Z) - Let Models Speak Ciphers: Multiagent Debate through Embeddings [84.20336971784495]
We introduce CIPHER (Communicative Inter-Model Protocol Through Embedding Representation) to address this issue.
By deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights.
This showcases the superiority and robustness of embeddings as an alternative "language" for communication among LLMs.
arXiv Detail & Related papers (2023-10-10T03:06:38Z) - Gloss-free Sign Language Translation: Improving from Visual-Language
Pretraining [56.26550923909137]
Gloss-Free Sign Language Translation (SLT) is a challenging task due to its cross-domain nature.
We propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-)
Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual and Text Decoder from
arXiv Detail & Related papers (2023-07-27T10:59:18Z) - DICT-MLM: Improved Multilingual Pre-Training using Bilingual
Dictionaries [8.83363871195679]
Masked modeling (MLM) objective as key language learning objective.
DICT-MLM works by incentivizing the model to be able to predict not just the original masked word, but potentially any of its cross-lingual synonyms as well.
Our empirical analysis on multiple downstream tasks spanning 30+ languages, demonstrates the efficacy of the proposed approach.
arXiv Detail & Related papers (2020-10-23T17:53:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.