Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation
- URL: http://arxiv.org/abs/2411.16789v1
- Date: Mon, 25 Nov 2024 09:01:41 GMT
- Title: Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation
- Authors: Jungeun Kim, Hyeongwoo Jeon, Jongseong Bae, Ha Young Kim,
- Abstract summary: We propose a novel gloss-free Multimodal Sign Language Translation framework.
We generate detailed textual descriptions of sign language components using multimodal large language models.
Our approach achieves state-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily.
- Score: 6.688680877428467
- License:
- Abstract: Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called Multimodal Sign Language Translation (MMSLT), which leverages the representational capabilities of off-the-shelf multimodal large language models (MLLMs). Specifically, we generate detailed textual descriptions of sign language components using MLLMs. Then, through our proposed multimodal-language pre-training module, we integrate these description features with sign video features to align them within the spoken sentence space. Our approach achieves state-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily, highlighting the potential of MLLMs to be effectively utilized in SLT.
Related papers
- Reasoning Over the Glyphs: Evaluation of LLM's Decipherment of Rare Scripts [0.6144680854063939]
We introduce a novel approach to construct a multimodal dataset of linguistic puzzles involving rare scripts.
We conduct experiments using prominent models, GPT-4o, Gemini, and Claude 3.5 Sonnet, on linguistic puzzles.
Our findings reveal the strengths and limitations of current AI methods in linguistic decipherment.
arXiv Detail & Related papers (2025-01-29T17:24:19Z) - LLaVA-SLT: Visual Language Tuning for Sign Language Translation [42.20090162339927]
Recent advancements in Sign Language Translation (SLT) have shown promise, yet they often largely lag behind gloss-based approaches in terms of accuracy.
We introduce LLaVA-SLT, a pioneering Large Multimodal Model (LMM) framework designed to leverage the power of Large Language Models (LLMs) through effectively learned visual language embeddings.
Our comprehensive experiments demonstrate that LLaVA-SLT outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-12-21T08:01:08Z) - LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation [60.02145113467427]
This work introduces a fine-tuning approach that integrates large language models with the pretrained CLIP visual encoder.
To address the challenge of LLMs' autoregressive nature, we propose a caption-to-caption contrastive learning framework.
Our method achieves substantial performance gains on various downstream tasks.
arXiv Detail & Related papers (2024-11-07T18:59:16Z) - SCOPE: Sign Language Contextual Processing with Embedding from LLMs [49.5629738637893]
Sign languages, used by around 70 million Deaf individuals globally, are visual languages that convey visual and contextual information.
Current methods in vision-based sign language recognition ( SLR) and translation (SLT) struggle with dialogue scenes due to limited dataset diversity and the neglect of contextually relevant information.
We introduce SCOPE, a novel context-aware vision-based SLR and SLT framework.
arXiv Detail & Related papers (2024-09-02T08:56:12Z) - An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs [7.630967411418269]
Gloss-free Sign Language Translation (SLT) converts sign videos directly into spoken language sentences without relying on glosses.
This paper emphasizes the importance of capturing the spatial configurations and motion dynamics inherent in sign language.
We introduce Spatial and Motion-based Sign Language Translation (SpaMo), a novel LLM-based SLT framework.
arXiv Detail & Related papers (2024-08-20T07:10:40Z) - ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models [73.34709921061928]
We propose a training-free method to inject visual prompts into Multimodal Large Language Models (MLLMs)
We optimize a learnable latent variable based on an energy function, enhancing the strength of referring regions in the attention map.
Our method offers a promising direction for integrating referring abilities into MLLMs, and supports referring with box, mask, scribble and point.
arXiv Detail & Related papers (2024-07-31T11:40:29Z) - LLMs are Good Sign Language Translators [19.259163728870696]
Sign Language Translation is a challenging task that aims to translate sign videos into spoken language.
We propose a novel SignLLM framework to transform sign videos into a language-like representation.
We achieve state-of-the-art gloss-free results on two widely-used SLT benchmarks.
arXiv Detail & Related papers (2024-04-01T05:07:13Z) - Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models [117.20416338476856]
Large language models (LLMs) demonstrate remarkable multilingual capabilities without being pre-trained on specially curated multilingual parallel corpora.
We propose a novel detection method, language activation probability entropy (LAPE), to identify language-specific neurons within LLMs.
Our findings indicate that LLMs' proficiency in processing a particular language is predominantly due to a small subset of neurons.
arXiv Detail & Related papers (2024-02-26T09:36:05Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - Let Models Speak Ciphers: Multiagent Debate through Embeddings [84.20336971784495]
We introduce CIPHER (Communicative Inter-Model Protocol Through Embedding Representation) to address this issue.
By deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights.
This showcases the superiority and robustness of embeddings as an alternative "language" for communication among LLMs.
arXiv Detail & Related papers (2023-10-10T03:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.