Fugu-MT 論文翻訳(概要): "Înţelegi Româneşte?'' A Recipe for Romanian Vision-Language Models

論文の概要: "Înţelegi Româneşte?'' A Recipe for Romanian Vision-Language Models

arxiv url: http://arxiv.org/abs/2605.31401v2
Date: Mon, 01 Jun 2026 04:58:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 18:24:16.922803
Title: "Înţelegi Româneşte?'' A Recipe for Romanian Vision-Language Models
Title（参考訳）: 「ルーマニアの視覚・言語モデルのためのレシピ」
Authors: Mihai Masala, Marius Leordeanu, Mihai Dascalu, Traian Rebedea,
Abstract要約: ルーマニア語のための言語固有の視覚言語モデル(VLM)を構築するための体系的研究について述べる。我々は、確立した英語VLMトレーニングと評価コーパスをルーマニア語に翻訳し、テキストアノテーションや画像内テキストに機械翻訳を適用した。このデータを用いて、様々なスケールと事前学習の視覚バックボーンの寄与を分離するために、一連のVLMを訓練し、アブレーションする。また、ルーマニアの日常シーンに根ざした文化的にネイティブな評価セットであるHolaVQAをキュレートする。
参考スコア（独自算出の注目度）: 7.120569645707792
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) largely follow the text-only LLM trajectory, excelling on English benchmarks but sharply degrading on low-resource languages, where neither large-scale image-text corpora nor culturally grounded evaluations exist. We present a systematic study of building a language-specific VLM for Romanian, covering the full pipeline from data construction to architectural choices. We translate established English VLM training and evaluation corpora into Romanian, applying machine translation to textual annotations and to in-image text, preserving visual grounding while adapting the textual content. Using this data, we train and ablate a series of VLMs to isolate the contribution of (i) vision backbones of varying scale and pretraining, (ii) language backbones from multilingual to Romanian-adapted LLMs, and (iii) OCR-style image-text data. We further curate HoraVQA, a culturally native evaluation set grounded in Romanian everyday scenes. Romanian-adapted VLMs consistently outperform their same-sized counterparts and, across all evaluated benchmarks, even surpass models from the next larger size category.
Abstract（参考訳）: Vision-Language Models (VLM) は主にテキストのみの LLM 軌跡に従っており、英語のベンチマークでは優れているが、大規模な画像テキストコーパスや文化的根拠のある評価が存在しない低リソース言語では著しく劣化している。ルーマニア語のための言語固有のVLMを構築するための体系的研究を行い、データ構築からアーキテクチャ選択までのパイプライン全体を網羅する。我々は、確立した英語VLMトレーニングと評価コーパスをルーマニア語に翻訳し、テキストアノテーションと画像内テキストに機械翻訳を適用し、テキストコンテンツに適応しながら視覚的グラウンドティングを保存する。このデータを使用して、一連のVLMをトレーニングし、アブレーションして、コントリビューションを分離する。一様々なスケールの視覚バックボーン及び事前訓練 (ii)多言語からルーマニア適応LLMまでの言語バックボーン (iii)OCRスタイルの画像テキストデータ。さらに、ルーマニアの日常シーンに根ざした文化的にネイティブな評価セットであるHolaVQAをキュレートする。ルーマニアに適応したVLMは、同じ大きさのVLMを一貫して上回り、評価されたすべてのベンチマークにおいて、次のより大きなサイズカテゴリのモデルよりもはるかに上回っている。

論文の概要: "Înţelegi Româneşte?'' A Recipe for Romanian Vision-Language Models

関連論文リスト