LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA
- URL: http://arxiv.org/abs/2509.10026v3
- Date: Fri, 10 Oct 2025 08:28:38 GMT
- Title: LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA
- Authors: Jing Huang, Zhiya Tan, Shutao Gong, Fanwei Zeng, Joey Tianyi Zhou, Changtao Miao, Huazhe Tan, Weibin Yao, Jianshu Li,
- Abstract summary: Chain-of-thought (CoT) reasoning has been proven to enhance interpretability and complex reasoning.<n>LaV-CoT is the first Language-aware Visual CoT framework with Multi-Aspect Reward Optimization.<n>LaV-CoT achieves up to 9.5% accuracy improvements over open-source baselines.
- Score: 39.131225916852834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large vision language models (VLMs) advance, their capabilities in multilingual visual question answering (mVQA) have significantly improved. Chain-of-thought (CoT) reasoning has been proven to enhance interpretability and complex reasoning. However, most existing approaches rely primarily on textual CoT and provide limited support for multilingual multimodal reasoning, constraining their deployment in real-world applications. To address this gap, we introduce LaV-CoT, the first Language-aware Visual CoT framework with Multi-Aspect Reward Optimization. LaV-CoT incorporates an interpretable multi-stage reasoning pipeline consisting of Text Summary with Bounding Box (BBox), Language Identification, Spatial Object-level Captioning, and Step-by-step Logical Reasoning. Following this reasoning pipeline, we design an automated data curation method that generates multilingual CoT annotations through iterative generation, correction, and refinement, enabling scalable and high-quality training data. To improve reasoning and generalization, LaV-CoT adopts a two-stage training paradigm combining Supervised Fine-Tuning (SFT) with Language-aware Group Relative Policy Optimization (GRPO), guided by verifiable multi-aspect rewards including language consistency, structural accuracy, and semantic alignment. Extensive evaluations on public datasets including MMMB, Multilingual MMBench, and MTVQA show that LaV-CoT achieves up to ~9.5% accuracy improvements over open-source baselines of similar size and even surpasses models with 2$\times$ larger scales by ~2.6%. Moreover, LaV-CoT outperforms advanced proprietary models such as GPT-4o-0513 and Gemini-2.5-flash. We further conducted an online A/B test to validate our method on real-world data, highlighting its effectiveness for industrial deployment. Our code is available at this link: https://github.com/HJNVR/LaV-CoT
Related papers
- Optimizing Language Models for Crosslingual Knowledge Consistency [90.86445137816942]
Large language models are known to often exhibit inconsistent knowledge.<n>This is particularly problematic in multilingual scenarios, where models are likely to be asked similar questions in different languages.<n>In this work, we show that this issue can be mitigated using reinforcement learning with a structured reward function.
arXiv Detail & Related papers (2026-03-04T23:36:55Z) - Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment [6.718469075779034]
We show that training standard pretrained models for cross-lingual alignment with a multi-way parallel corpus can substantially improve representations for NLU tasks.<n>We construct a multi-way parallel dataset using translations of English text from an off-the-shelf NMT model for a pool of six target languages.
arXiv Detail & Related papers (2026-02-25T03:58:24Z) - TowerVision: Understanding and Improving Multilinguality in Vision-Language Models [56.775118098058506]
TowerVision is a family of open multilingual vision-language models for both image-text and video-text tasks.<n>By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches.<n>To support further research, we publicly release all models, data, and training recipes.
arXiv Detail & Related papers (2025-10-22T17:02:48Z) - Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters [53.59868121093848]
We introduce Seed-X, a family of open-source language models (LLMs) with 7B parameter size.<n>The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages.<n>The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs.
arXiv Detail & Related papers (2025-07-18T03:19:43Z) - Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation [45.551223552275424]
Vision-Language Translation is a challenging task that requires accurately recognizing multilingual text embedded in images.<n>We present a comprehensive study of VLT from three key perspectives: data quality, model architecture, and evaluation metrics.
arXiv Detail & Related papers (2025-06-13T14:23:38Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.<n>P-MMEval delivers consistent language coverage across various datasets and provides parallel samples.<n>We conduct extensive experiments on representative multilingual model series to compare performances across models and tasks.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training [84.23022072347821]
We propose a regularized cross-lingual visio-textual contrastive learning objective that constrains the representation proximity of weakly-aligned visio-textual inputs.
Experiments on 5 downstream multi-modal tasks across 6 languages demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-05-13T14:41:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.