Fugu-MT 論文翻訳(概要): AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering

論文の概要: AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering

arxiv url: http://arxiv.org/abs/2603.09689v1
Date: Tue, 10 Mar 2026 13:57:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:24.354404
Title: AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering
Title（参考訳）: AutoViVQA:ベトナムの視覚的質問応答のための大規模自動構築データセット
Authors: Nguyen Anh Tuong, Phan Ba Duc, Nguyen Trung Quoc, Tran Dac Thinh, Dang Duy Lan, Nguyen Quoc Thinh, Tung Le,
Abstract要約: VQA(Visual Question Answering)は、モデルが視覚情報とテキスト情報を共同で理解する必要がある基本的なマルチモーダルタスクである。近年の研究では、VQAタスクにおいて、大規模言語モデルによって自動評価と人的判断の整合性がさらに向上することが示唆されている。
参考スコア（独自算出の注目度）: 2.4577252294937444
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Visual Question Answering (VQA) is a fundamental multimodal task that requires models to jointly understand visual and textual information. Early VQA systems relied heavily on language biases, motivating subsequent work to emphasize visual grounding and balanced datasets. With the success of large-scale pre-trained transformers for both text and vision domains -- such as PhoBERT for Vietnamese language understanding and Vision Transformers (ViT) for image representation learning -- multimodal fusion has achieved remarkable progress. For Vietnamese VQA, several datasets have been introduced to promote research in low-resource multimodal learning, including ViVQA, OpenViVQA, and the recently proposed ViTextVQA. These resources enable benchmarking of models that integrate linguistic and visual features in the Vietnamese context. Evaluation of VQA systems often employs automatic metrics originally designed for image captioning or machine translation, such as BLEU, METEOR, CIDEr, Recall, Precision, and F1-score. However, recent research suggests that large language models can further improve the alignment between automatic evaluation and human judgment in VQA tasks. In this work, we explore Vietnamese Visual Question Answering using transformer-based architectures, leveraging both textual and visual pre-training while systematically comparing automatic evaluation metrics under multilingual settings.
Abstract（参考訳）: VQA(Visual Question Answering)は、モデルが視覚情報とテキスト情報を共同で理解する必要がある基本的なマルチモーダルタスクである。初期のVQAシステムは言語バイアスに大きく依存しており、視覚的な接地とバランスの取れたデータセットを強調するためにその後の作業を動機付けていた。ベトナム語理解のためのPhoBERTや画像表現学習のためのVit(ViT)など、テキストと視覚の両方のための大規模事前学習型トランスフォーマーの成功により、マルチモーダル融合は目覚ましい進歩を遂げた。ベトナムのVQAでは、ViVQA、OpenViVQA、最近提案されたViTextVQAなど、低リソースのマルチモーダル学習の研究を促進するためにいくつかのデータセットが導入されている。これらのリソースはベトナムの文脈で言語的特徴と視覚的特徴を統合するモデルのベンチマークを可能にする。 VQAシステムの評価には、BLEU、METEOR、CIDEr、Recall、Precision、F1スコアなどの画像キャプションや機械翻訳用に設計された自動メトリクスを使用することが多い。しかし、近年の研究では、VQAタスクにおける自動評価と人的判断の整合性をさらに向上させることが示唆されている。本研究では,マルチランガル設定下での自動評価指標を体系的に比較しながら,テキストと視覚の両方の事前学習を活用するトランスフォーマーアーキテクチャを用いたベトナム語視覚質問応答について検討する。

論文の概要: AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering

関連論文リスト