Related papers: Brazilian Portuguese Image Captioning with Transformers: A Study on Cross-Native-Translated Dataset

Brazilian Portuguese Image Captioning with Transformers: A Study on Cross-Native-Translated Dataset

URL: http://arxiv.org/abs/2602.00393v1
Date: Fri, 30 Jan 2026 23:12:03 GMT
Title: Brazilian Portuguese Image Captioning with Transformers: A Study on Cross-Native-Translated Dataset
Authors: Gabriel Bromonschenkel, Alessandro L. Koerich, Thiago M. Paixão, Hilário Tomaz Alves de Oliveira,
Abstract summary: This study proposes a cross-native evaluation of Transformer-based vision and language models for Brazilian Portuguese IC.<n>We use a version of Flickr30K comprised of captions manually created by native Brazilian Portuguese speakers and compare it to a version with captions automatically translated from English to Portuguese.<n>Our findings show that Swin-DistilBERTimbau consistently outperforms other models, demonstrating strong generalization across datasets.
Score: 41.02122939888977
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image captioning (IC) refers to the automatic generation of natural language descriptions for images, with applications ranging from social media content generation to assisting individuals with visual impairments. While most research has been focused on English-based models, low-resource languages such as Brazilian Portuguese face significant challenges due to the lack of specialized datasets and models. Several studies create datasets by automatically translating existing ones to mitigate resource scarcity. This work addresses this gap by proposing a cross-native-translated evaluation of Transformer-based vision and language models for Brazilian Portuguese IC. We use a version of Flickr30K comprised of captions manually created by native Brazilian Portuguese speakers and compare it to a version with captions automatically translated from English to Portuguese. The experiments include a cross-context approach, where models trained on one dataset are tested on the other to assess the translation impact. Additionally, we incorporate attention maps for model inference interpretation and use the CLIP-Score metric to evaluate the image-description alignment. Our findings show that Swin-DistilBERTimbau consistently outperforms other models, demonstrating strong generalization across datasets. ViTucano, a Brazilian Portuguese pre-trained VLM, surpasses larger multilingual models (GPT-4o, LLaMa 3.2 Vision) in traditional text-based evaluation metrics, while GPT-4 models achieve the highest CLIP-Score, highlighting improved image-text alignment. Attention analysis reveals systematic biases, including gender misclassification, object enumeration errors, and spatial inconsistencies. The datasets and the models generated and analyzed during the current study are available in: https://github.com/laicsiifes/transformer-caption-ptbr.

Related papers

Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation [0.0]
This survey reviews attention-based image captioning models, categorizing them into transformer-based, deep learning-based, and hybrid approaches.<n>It explores benchmark datasets, discusses evaluation metrics such as BLEU, METEOR, CIDEr, and ROUGE, and highlights challenges in multilingual captioning.<n>We identify future research directions, such as multimodal learning, real-time applications in AI-powered assistants, healthcare, and forensic analysis.
arXiv Detail & Related papers (2025-06-03T22:18:19Z)
A Benchmark for Multi-Lingual Vision-Language Learning in Remote Sensing Image Captioning [27.350370419751385]
Remote Sensing Image Captioning (RSIC) is a cross-modal field bridging vision and language, aimed at automatically generating natural language descriptions of features and scenes in remote sensing imagery.<n>Two critical challenges persist: the scarcity of non-English descriptive datasets and the lack of multilingual capability evaluation for models.<n>This paper introduces and analyzes BRSIC, a comprehensive bilingual dataset that enriches three established English RSIC datasets with Chinese descriptions, encompassing 13,634 images paired with 68,170 bilingual captions.
arXiv Detail & Related papers (2025-03-06T16:31:34Z)
Multilingual Vision-Language Pre-training for the Remote Sensing Domain [4.118895088882213]
Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data. This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model. Our resulting model, which we named Remote Sensing Multilingual CLIP (RS-M-CLIP), obtains state-of-the-art results in a variety of vision-and-language tasks.
arXiv Detail & Related papers (2024-10-30T18:13:11Z)
Spanish TrOCR: Leveraging Transfer Learning for Language Adaptation [0.0]
This study explores the transfer learning capabilities of the TrOCR architecture to Spanish. We integrate an English TrOCR encoder with a language specific decoder and train the model on this specific language. Fine-tuning the English TrOCR on Spanish yields superior recognition than the language specific decoder for a fixed dataset size.
arXiv Detail & Related papers (2024-07-09T15:31:41Z)
Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities. We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z)
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z)
ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models [102.63817106363597]
We build ELEVATER, the first benchmark to compare and evaluate pre-trained language-augmented visual models. It consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge. We will release our toolkit and evaluation platforms for the research community.
arXiv Detail & Related papers (2022-04-19T10:23:42Z)
Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements. We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z)
Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model. We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z)
Cross-modal Language Generation using Pivot Stabilization for Web-scale Language Coverage [23.71195344840051]
Cross-modal language generation tasks such as image captioning are directly hurt by the trend of data-hungry models combined with the lack of non-English annotations. We describe an approach called Pivot-Language Generation Stabilization (PLuGS), which leverages directly at training time both existing English annotations and their machine-translated versions. We show that PLuGS models outperform other candidate solutions in evaluations performed over 5 different target languages.
arXiv Detail & Related papers (2020-05-01T06:58:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.