Related papers: Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models

Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models

URL: http://arxiv.org/abs/2504.13945v3
Date: Wed, 23 Apr 2025 09:16:55 GMT
Title: Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models
Authors: Zhanglin Wu, Tengfei Song, Ning Xie, Mengli Zhu, Weidong Zhang, Shuang Wu, Pengfei Li, Chong Li, Junhao Zhu, Hao Yang, Shiliang Sun,
Abstract summary: We propose a specialized evaluation framework emphasizing the pivotal role of menu translation in cross-cultural communication.<n>MOTBench requires LVLMs to accurately recognize and translate each dish, along with its price and unit items on a menu, along with precise human annotations.<n>Our benchmark is comprised of a collection of Chinese and English menus, characterized by intricate layouts, a variety of fonts, and culturally specific elements across different languages, along with precise human annotations.
Score: 44.159383734605456
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The rapid advancement of large vision-language models (LVLMs) has significantly propelled applications in document understanding, particularly in optical character recognition (OCR) and multilingual translation. However, current evaluations of LVLMs, like the widely used OCRBench, mainly focus on verifying the correctness of their short-text responses and long-text responses with simple layout, while the evaluation of their ability to understand long texts with complex layout design is highly significant but largely overlooked. In this paper, we propose Menu OCR and Translation Benchmark (MOTBench), a specialized evaluation framework emphasizing the pivotal role of menu translation in cross-cultural communication. MOTBench requires LVLMs to accurately recognize and translate each dish, along with its price and unit items on a menu, providing a comprehensive assessment of their visual understanding and language processing capabilities. Our benchmark is comprised of a collection of Chinese and English menus, characterized by intricate layouts, a variety of fonts, and culturally specific elements across different languages, along with precise human annotations. Experiments show that our automatic evaluation results are highly consistent with professional human evaluation. We evaluate a range of publicly available state-of-the-art LVLMs, and through analyzing their output to identify the strengths and weaknesses in their performance, offering valuable insights to guide future advancements in LVLM development. MOTBench is available at https://github.com/gitwzl/MOTBench.

Related papers

Multilingual Self-Taught Faithfulness Evaluators [11.200203292660758]
Self-Taught Evaluators for Multilingual Faithfulness is a framework that learns exclusively from synthetic multilingual summarization data.<n>Our framework shows improvements over existing baselines, including state-of-the-art English evaluators and machine translation-based approaches.
arXiv Detail & Related papers (2025-07-28T12:01:59Z)
Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation [45.551223552275424]
Vision-Language Translation is a challenging task that requires accurately recognizing multilingual text embedded in images.<n>We present a comprehensive study of VLT from three key perspectives: data quality, model architecture, and evaluation metrics.
arXiv Detail & Related papers (2025-06-13T14:23:38Z)
M-Prometheus: A Suite of Open Multilingual LLM Judges [64.22940792713713]
We introduce M-Prometheus, a suite of open-weight LLM judges that can provide both direct assessment and pairwise comparison feedback on multilingual outputs. M-Prometheus models outperform state-of-the-art open LLM judges on multilingual reward benchmarks spanning more than 20 languages, as well as on literary machine translation (MT) evaluation covering 4 language pairs.
arXiv Detail & Related papers (2025-04-07T11:37:26Z)
A Benchmark for Multi-Lingual Vision-Language Learning in Remote Sensing Image Captioning [27.350370419751385]
Remote Sensing Image Captioning (RSIC) is a cross-modal field bridging vision and language, aimed at automatically generating natural language descriptions of features and scenes in remote sensing imagery.<n>Two critical challenges persist: the scarcity of non-English descriptive datasets and the lack of multilingual capability evaluation for models.<n>This paper introduces and analyzes BRSIC, a comprehensive bilingual dataset that enriches three established English RSIC datasets with Chinese descriptions, encompassing 13,634 images paired with 68,170 bilingual captions.
arXiv Detail & Related papers (2025-03-06T16:31:34Z)
ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding [15.93642619347214]
We introduce proverbeval, LLM evaluation benchmark for low-resource languages. Native language proverb descriptions significantly improve tasks such as proverb generation. monolingual evaluations consistently outperformed their cross-lingual counterparts in generation tasks.
arXiv Detail & Related papers (2024-11-07T06:34:48Z)
Parrot: Multilingual Visual Instruction Tuning [66.65963606552839]
Existing methods mainly focus on aligning vision encoders with Multimodal Large Language Models (MLLMs) We introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level. Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks.
arXiv Detail & Related papers (2024-06-04T17:56:28Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
Machine Translation with Large Language Models: Prompt Engineering for Persian, English, and Russian Directions [0.0]
Generative large language models (LLMs) have demonstrated exceptional proficiency in various natural language processing (NLP) tasks. We conducted an investigation into two popular prompting methods and their combination, focusing on cross-language combinations of Persian, English, and Russian.
arXiv Detail & Related papers (2024-01-16T15:16:34Z)
DIALIGHT: Lightweight Multilingual Development and Evaluation of Task-Oriented Dialogue Systems with Large Language Models [76.79929883963275]
DIALIGHT is a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems. It features a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level. Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses.
arXiv Detail & Related papers (2024-01-04T11:27:48Z)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.