LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models
- URL: http://arxiv.org/abs/2603.00592v1
- Date: Sat, 28 Feb 2026 10:53:33 GMT
- Title: LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models
- Authors: Yuchen Hou, Lin Zhao,
- Abstract summary: Vision-Language-Action (VLA) models achieve over 95% success on standard benchmarks.<n>We find that current state-of-the-art VLA models largely ignore language instructions.<n>This paper constructs the LangGap benchmark, based on a four-dimensional semantic perturbation method.
- Score: 4.54067274409672
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language-Action (VLA) models achieve over 95% success on standard benchmarks. However, through systematic experiments, we find that current state-of-the-art VLA models largely ignore language instructions. Prior work lacks: (1) systematic semantic perturbation diagnostics, (2) a benchmark that forces language understanding by design, and (3) linguistically diverse training data. This paper constructs the LangGap benchmark, based on a four-dimensional semantic perturbation method -- varying instruction semantics while keeping the tabletop layout fixed -- revealing language understanding deficits in π0.5. Existing benchmarks like LIBERO assign only one task per layout, underutilizing available objects and target locations; LangGap fully diversifies pick-and-place tasks under identical layouts, forcing models to truly understand language. Experiments show that targeted data augmentation can partially close the language gap -- success rate improves from 0% to 90% with single-task training, and 0% to 28% with multi-task training. However, as semantic diversity of extended tasks increases, model learning capacity proves severely insufficient; even trained tasks perform poorly. This reveals a fundamental challenge for VLA models in understanding diverse language instructions -- precisely the long-term value of LangGap.
Related papers
- Language Steering for Multilingual In-Context Learning [10.932074928744568]
Large language models' performance on non-English languages remains substantially inferior to English.<n>We propose language vectors -- a training-free language steering approach.<n>We show consistent improvements on multilingual in-context learning over baselines across all tasks and languages tested.
arXiv Detail & Related papers (2026-02-02T16:52:09Z) - Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models [55.14276067678253]
This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in Large Language Models (LLMs)<n>We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models.<n>Further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns.
arXiv Detail & Related papers (2025-05-24T12:31:27Z) - SLAM: Towards Efficient Multilingual Reasoning via Selective Language Alignment [78.4550589538805]
We propose an efficient multilingual reasoning alignment approach that precisely identifies and fine-tunes the layers responsible for handling multilingualism.<n> Experimental results show that our method, SLAM, only tunes 6 layers' feed-forward sub-layers including 6.5-8% of all parameters within 7B and 13B LLMs.
arXiv Detail & Related papers (2025-01-07T10:29:43Z) - Pre-Trained Language-Meaning Models for Multilingual Parsing and
Generation [14.309869321407522]
We introduce multilingual pre-trained language-meaning models based on Discourse Representation Structures (DRSs)
Since DRSs are language neutral, cross-lingual transfer learning is adopted to further improve the performance of non-English tasks.
automatic evaluation results show that our approach achieves the best performance on both the multilingual DRS parsing and DRS-to-text generation tasks.
arXiv Detail & Related papers (2023-05-31T19:00:33Z) - Por Qu\'e N\~ao Utiliser Alla Spr{\aa}k? Mixed Training with Gradient
Optimization in Few-Shot Cross-Lingual Transfer [2.7213511121305465]
We propose a one-step mixed training method that trains on both source and target data.
We use one model to handle all target languages simultaneously to avoid excessively language-specific models.
Our proposed method achieves state-of-the-art performance on all tasks and outperforms target-adapting by a large margin.
arXiv Detail & Related papers (2022-04-29T04:05:02Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - From Good to Best: Two-Stage Training for Cross-lingual Machine Reading
Comprehension [51.953428342923885]
We develop a two-stage approach to enhance the model performance.
The first stage targets at recall: we design a hard-learning (HL) algorithm to maximize the likelihood that the top-k predictions contain the accurate answer.
The second stage focuses on precision: an answer-aware contrastive learning mechanism is developed to learn the fine difference between the accurate answer and other candidates.
arXiv Detail & Related papers (2021-12-09T07:31:15Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - Zero-Shot Cross-Lingual Transfer with Meta Learning [45.29398184889296]
We consider the setting of training models on multiple languages at the same time, when little or no data is available for languages other than English.
We show that this challenging setup can be approached using meta-learning.
We experiment using standard supervised, zero-shot cross-lingual, as well as few-shot cross-lingual settings for different natural language understanding tasks.
arXiv Detail & Related papers (2020-03-05T16:07:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.