Related papers: On Non-interactive Evaluation of Animal Communication Translators

On Non-interactive Evaluation of Animal Communication Translators

URL: http://arxiv.org/abs/2510.15768v1
Date: Fri, 17 Oct 2025 15:56:30 GMT
Title: On Non-interactive Evaluation of Animal Communication Translators
Authors: Orr Paradise, David F. Gruber, Adam Tauman Kalai,
Abstract summary: This is an instance of machine translation quality evaluation (MTQE) without any reference translations available.<n>The idea is to translate animal communication, turn by turn, and evaluate how often the resulting translations make more sense in order than permuted.
Score: 8.958679534486855
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: If you had an AI Whale-to-English translator, how could you validate whether or not it is working? Does one need to interact with the animals or rely on grounded observations such as temperature? We provide theoretical and proof-of-concept experimental evidence suggesting that interaction and even observations may not be necessary for sufficiently complex languages. One may be able to evaluate translators solely by their English outputs, offering potential advantages in terms of safety, ethics, and cost. This is an instance of machine translation quality evaluation (MTQE) without any reference translations available. A key challenge is identifying ``hallucinations,'' false translations which may appear fluent and plausible. We propose using segment-by-segment translation together with the classic NLP shuffle test to evaluate translators. The idea is to translate animal communication, turn by turn, and evaluate how often the resulting translations make more sense in order than permuted. Proof-of-concept experiments on data-scarce human languages and constructed languages demonstrate the potential utility of this evaluation methodology. These human-language experiments serve solely to validate our reference-free metric under data scarcity. It is found to correlate highly with a standard evaluation based on reference translations, which are available in our experiments. We also perform a theoretical analysis suggesting that interaction may not be necessary nor efficient in the early stages of learning to translate.

Related papers

Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering [68.3400058037817]
We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality.<n>We show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations.
arXiv Detail & Related papers (2025-04-10T09:24:54Z)
Cross-lingual neural fuzzy matching for exploiting target-language monolingual corpora in computer-aided translation [0.0]
In this paper, we introduce a novel neural approach aimed at exploiting in-domain target-language (TL) monolingual corpora. Our approach relies on cross-lingual sentence embeddings to retrieve translation proposals from TL monolingual corpora, and on a neural model to estimate their post-editing effort. The paper presents an automatic evaluation of these techniques on four language pairs that shows that our approach can successfully exploit monolingual texts in a TM-based CAT environment.
arXiv Detail & Related papers (2024-01-16T14:00:28Z)
Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z)
A Theory of Unsupervised Translation Motivated by Understanding Animal Communication [7.748040467625809]
We propose a theoretical framework for analyzing Unsupervised Machine Translation. We show that the error rates are inversely related to the language complexity and amount of common ground. This suggests that unsupervised translation of animal communication may be feasible if the communication system is sufficiently complex.
arXiv Detail & Related papers (2022-11-20T20:55:38Z)
Rethinking Round-Trip Translation for Machine Translation Evaluation [44.83568796515321]
We report the surprising finding that round-trip translation can be used for automatic evaluation without the references. We demonstrate the rectification is overdue as round-trip translation could benefit multiple machine translation evaluation tasks.
arXiv Detail & Related papers (2022-09-15T15:06:20Z)
An Interpretability Evaluation Benchmark for Pre-trained Language Models [37.16893581395874]
We propose a novel evaluation benchmark providing with both English and Chinese annotated data. It tests LMs abilities in multiple dimensions, i.e., grammar, semantics, knowledge, reasoning and computation. It contains perturbed instances for each original instance, so as to use the rationale consistency under perturbations as the metric for faithfulness.
arXiv Detail & Related papers (2022-07-28T08:28:09Z)
A Bayesian approach to translators' reliability assessment [0.0]
We consider the Translation Quality Assessment process as a complex process, considering it from the physics of complex systems point of view. We build two Bayesian models that parameterise the features involved in the TQA process, namely the translation difficulty, the characteristics of the translators involved in producing the translation and assessing its quality. We show that reviewers reliability cannot be taken for granted even if they are expert translators.
arXiv Detail & Related papers (2022-03-14T14:29:45Z)
ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee. It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z)
Self-Training Sampling with Monolingual Data Uncertainty for Neural Machine Translation [98.83925811122795]
We propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data. We compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data. Experimental results on large-scale WMT English$Rightarrow$German and English$Rightarrow$Chinese datasets demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2021-06-02T05:01:36Z)
Translation Artifacts in Cross-lingual Transfer Learning [51.66536640084888]
We show that machine translation can introduce subtle artifacts that have a notable impact in existing cross-lingual models. In natural language inference, translating the premise and the hypothesis independently can reduce the lexical overlap between them. We also improve the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively.
arXiv Detail & Related papers (2020-04-09T17:54:30Z)
Bootstrapping a Crosslingual Semantic Parser [74.99223099702157]
We adapt a semantic trained on a single language, such as English, to new languages and multiple domains with minimal annotation. We query if machine translation is an adequate substitute for training data, and extend this to investigate bootstrapping using joint training with English, paraphrasing, and multilingual pre-trained models.
arXiv Detail & Related papers (2020-04-06T12:05:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.