Translation Analytics for Freelancers: I. Introduction, Data Preparation, Baseline Evaluations
- URL: http://arxiv.org/abs/2504.14619v1
- Date: Sun, 20 Apr 2025 13:54:28 GMT
- Title: Translation Analytics for Freelancers: I. Introduction, Data Preparation, Baseline Evaluations
- Authors: Yuri Balashov, Alex Balashov, Shiho Fukuda Koski,
- Abstract summary: This is the first in a series of papers exploring the rapidly expanding new opportunities arising from recent progress in language technologies.<n>We aim to empower translators with actionable methods to harness these advancements.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This is the first in a series of papers exploring the rapidly expanding new opportunities arising from recent progress in language technologies for individual translators and language service providers with modest resources. The advent of advanced neural machine translation systems, large language models, and their integration into workflows via computer-assisted translation tools and translation management systems have reshaped the translation landscape. These advancements enable not only translation but also quality evaluation, error spotting, glossary generation, and adaptation to domain-specific needs, creating new technical opportunities for freelancers. In this series, we aim to empower translators with actionable methods to harness these advancements. Our approach emphasizes Translation Analytics, a suite of evaluation techniques traditionally reserved for large-scale industry applications but now becoming increasingly available for smaller-scale users. This first paper introduces a practical framework for adapting automatic evaluation metrics -- such as BLEU, chrF, TER, and COMET -- to freelancers' needs. We illustrate the potential of these metrics using a trilingual corpus derived from a real-world project in the medical domain and provide statistical analysis correlating human evaluations with automatic scores. Our findings emphasize the importance of proactive engagement with emerging technologies to not only adapt but thrive in the evolving professional environment.
Related papers
- Improving Retrieval-Augmented Neural Machine Translation with Monolingual Data [9.67203800171351]
In many settings, in-domain monolingual target-side corpora are often available.
This work explores ways to take advantage of such resources by retrieving relevant segments directly in the target language.
In experiments with two RANMT architectures, we first demonstrate the benefits of such cross-lingual objectives in a controlled setting.
We then showcase our method on a real-world set-up, where the target monolingual resources far exceed the amount of parallel data.
arXiv Detail & Related papers (2025-04-30T15:41:03Z) - Efficient Machine Translation Corpus Generation: Integrating Human-in-the-Loop Post-Editing with Large Language Models [4.574830585715128]
This paper introduces an advanced methodology for machine translation (MT) corpus generation.<n>It integrates semi-automated, human-in-the-loop post-editing with large language models (LLMs) to enhance efficiency and translation quality.
arXiv Detail & Related papers (2025-02-18T11:16:38Z) - Towards Global AI Inclusivity: A Large-Scale Multilingual Terminology Dataset (GIST) [19.91873751674613]
GIST is a large-scale multilingual AI terminology dataset containing 5K terms extracted from top AI conference papers spanning 2000 to 2023.<n>The terms are translated into Arabic, Chinese, French, Japanese, and Russian using a hybrid framework that combines LLMs for extraction with human expertise for translation.<n>This work aims to address critical gaps in AI terminology resources and fosters global inclusivity and collaboration in AI research.
arXiv Detail & Related papers (2024-12-24T11:50:18Z) - Building an Efficient Multilingual Non-Profit IR System for the Islamic Domain Leveraging Multiprocessing Design in Rust [0.0]
This work focuses on the development of a multilingual non-profit IR system for the Islamic domain.
By employing methods like continued pre-training for domain adaptation and language reduction to decrease model size, a lightweight multilingual retrieval model was prepared.
arXiv Detail & Related papers (2024-11-09T11:37:18Z) - Prompting Encoder Models for Zero-Shot Classification: A Cross-Domain Study in Italian [75.94354349994576]
This paper explores the feasibility of employing smaller, domain-specific encoder LMs alongside prompting techniques to enhance performance in specialized contexts.
Our study concentrates on the Italian bureaucratic and legal language, experimenting with both general-purpose and further pre-trained encoder-only models.
The results indicate that while further pre-trained models may show diminished robustness in general knowledge, they exhibit superior adaptability for domain-specific tasks, even in a zero-shot setting.
arXiv Detail & Related papers (2024-07-30T08:50:16Z) - (Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts [56.7988577327046]
We introduce TransAgents, a novel multi-agent framework that simulates the roles and collaborative practices of a human translation company.
Our findings highlight the potential of multi-agent collaboration in enhancing translation quality, particularly for longer texts.
arXiv Detail & Related papers (2024-05-20T05:55:08Z) - Cross-lingual neural fuzzy matching for exploiting target-language
monolingual corpora in computer-aided translation [0.0]
In this paper, we introduce a novel neural approach aimed at exploiting in-domain target-language (TL) monolingual corpora.
Our approach relies on cross-lingual sentence embeddings to retrieve translation proposals from TL monolingual corpora, and on a neural model to estimate their post-editing effort.
The paper presents an automatic evaluation of these techniques on four language pairs that shows that our approach can successfully exploit monolingual texts in a TM-based CAT environment.
arXiv Detail & Related papers (2024-01-16T14:00:28Z) - Exploring Large Language Model for Graph Data Understanding in Online
Job Recommendations [63.19448893196642]
We present a novel framework that harnesses the rich contextual information and semantic representations provided by large language models to analyze behavior graphs.
By leveraging this capability, our framework enables personalized and accurate job recommendations for individual users.
arXiv Detail & Related papers (2023-07-10T11:29:41Z) - Rethinking Round-Trip Translation for Machine Translation Evaluation [44.83568796515321]
We report the surprising finding that round-trip translation can be used for automatic evaluation without the references.
We demonstrate the rectification is overdue as round-trip translation could benefit multiple machine translation evaluation tasks.
arXiv Detail & Related papers (2022-09-15T15:06:20Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.