Related papers: How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation

How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation

URL: http://arxiv.org/abs/2511.09748v1
Date: Fri, 14 Nov 2025 01:07:40 GMT
Title: How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation
Authors: Muskaan Chopra, Lorenz Sparrenberg, Sarthak Khanna, Rafet Sifa,
Abstract summary: We benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025.<n>Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput)<n>Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off,
Score: 1.3288901827225499
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection (CED), we benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC=0.77 with F1-ERR=0.98 on SynCED-EnDe-2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). At larger scale, Qwen-3-1.7B attains the highest absolute MCC (+0.11 over Gemma) but with higher compute cost. In contrast, ultra-small models (0.6B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs augmented with lightweight calibration and small-sample supervision can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines. All datasets, prompts, and scripts are publicly available at our GitHub repository.

Related papers

Evaluating LLMs and Prompting Strategies for Automated Hardware Diagnosis from Textual User-Reports [0.43981305860983716]
Large Language Models (LLMs) have shown promise in addressing such issues.<n>This study evaluates 27 open-source models (1B-72B parameters) and 2 proprietary LLMs using four prompting strategies.<n>Three models offer the best balance between size and performance: mistral-small-24b-instruct and two smaller models, llama-3.2-1b-instruct and gemma-2-2b-it.
arXiv Detail & Related papers (2025-07-01T13:46:00Z)
Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation [0.0]
Tiny QA Benchmark++ (TQB++) is designed to give large-language-model (LLM) pipelines a unit-test style safety net dataset that runs in seconds with minimal cost.<n>TQB++ couples a 52-item English gold set with a tiny synthetic-data generator pypi package built on provider-agnostic LiteLLM.<n>Every dataset ships with Croissant metadata and plug-and-play files for OpenAI-Evals, LangChain, and standard CI tools.
arXiv Detail & Related papers (2025-05-17T15:40:03Z)
MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools [54.63478102768333]
Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions.<n>We propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools.
arXiv Detail & Related papers (2025-04-28T18:06:38Z)
Grammatical Error Correction for Low-Resource Languages: The Case of Zarma [8.40484790921164]
Grammatical error correction aims to improve quality and readability of texts.<n>We present a study on GEC for Zarma, spoken by over five million in West Africa.<n>We compare three approaches: rule-based methods, machine translation (MT) models, and large language models.
arXiv Detail & Related papers (2024-10-20T23:51:36Z)
Automated Multi-Language to English Machine Translation Using Generative Pre-Trained Transformers [0.8192907805418583]
This study examines using local Generative Pretrained Transformer (GPT) models to perform automated zero shot black-box, sentence wise, multi-natural-language translation into English text. We benchmark 16 different open-source GPT models, with no custom fine-tuning, from the Huggingface LLM repository for translating 50 different non-English languages into English. Benchmark metrics that are reported are language translation accuracy, using BLEU, GLEU, METEOR, and chrF text overlap measures, and wall-clock time for each sentence translation.
arXiv Detail & Related papers (2024-04-23T02:19:35Z)
How Far Can 100 Samples Go? Unlocking Overall Zero-Shot Multilingual Translation via Tiny Multi-Parallel Data [10.286714403840355]
A common, albeit resource-consuming, solution is to add as many related translation directions as possible to the training corpus. We show that for an English-centric model, surprisingly large zero-shot improvements can be achieved by simply fine-tuning with a very small amount of multi-parallel data.
arXiv Detail & Related papers (2024-01-22T23:55:00Z)
Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation [50.00235162432848]
We train ALMA models with only 22K parallel sentences and 12M parameters. The resulting model, called ALMA-R, can match or exceed the performance of the WMT competition winners and GPT-4.
arXiv Detail & Related papers (2024-01-16T15:04:51Z)
BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of Faithfulness Metrics [70.52570641514146]
We present a benchmark of unfaithful minimal pairs (BUMP) BUMP is a dataset of 889 human-written, minimally different summary pairs. Unlike non-pair-based datasets, BUMP can be used to measure the consistency of metrics.
arXiv Detail & Related papers (2022-12-20T02:17:30Z)
SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages [102.50127671423752]
We introduce SMaLL-100, a distilled version of the M2M-100 (12B) machine translation model covering 100 languages. We train SMaLL-100 with uniform sampling across all language pairs and therefore focus on preserving the performance of low-resource languages. Our model achieves comparable results to M2M-100 (1.2B), while being 3.6x smaller and 4.3x faster at inference.
arXiv Detail & Related papers (2022-10-20T22:32:29Z)
Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs) We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z)
LoRA: Low-Rank Adaptation of Large Language Models [71.75808607987281]
Low-Rank Adaptation, or LoRA, freezes the pre-trained model weights and injects trainable rank decomposition into each layer of the Transformer architecture. For GPT-3, LoRA can reduce the number of trainable parameters by 10,000 times and the computation hardware requirement by 3 times compared to full fine-tuning.
arXiv Detail & Related papers (2021-06-17T17:37:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.