Machine Translation for Accessible Multi-Language Text Analysis
- URL: http://arxiv.org/abs/2301.08416v1
- Date: Fri, 20 Jan 2023 04:11:38 GMT
- Title: Machine Translation for Accessible Multi-Language Text Analysis
- Authors: Edward W. Chew, William D. Weisman, Jingying Huang, Seth Frey
- Abstract summary: We show that English-trained measures computed after translation to English have adequate-to-excellent accuracy.
We show this for three major analytics -- sentiment analysis, topic analysis, and word embeddings -- over 16 languages.
- Score: 1.5484595752241124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: English is the international standard of social research, but scholars are
increasingly conscious of their responsibility to meet the need for scholarly
insight into communication processes globally. This tension is as true in
computational methods as any other area, with revolutionary advances in the
tools for English language texts leaving most other languages far behind. In
this paper, we aim to leverage those very advances to demonstrate that
multi-language analysis is currently accessible to all computational scholars.
We show that English-trained measures computed after translation to English
have adequate-to-excellent accuracy compared to source-language measures
computed on original texts. We show this for three major analytics -- sentiment
analysis, topic analysis, and word embeddings -- over 16 languages, including
Spanish, Chinese, Hindi, and Arabic. We validate this claim by comparing
predictions on original language tweets and their backtranslations: double
translations from their source language to English and back to the source
language. Overall, our results suggest that Google Translate, a simple and
widely accessible tool, is effective in preserving semantic content across
languages and methods. Modern machine translation can thus help computational
scholars make more inclusive and general claims about human communication.
Related papers
- Enhancing Language Learning through Technology: Introducing a New English-Azerbaijani (Arabic Script) Parallel Corpus [0.9051256541674136]
This paper introduces a pioneering English-Azerbaijani (Arabic Script) parallel corpus.
It is designed to bridge the technological gap in language learning and machine translation for under-resourced languages.
arXiv Detail & Related papers (2024-07-06T21:23:20Z) - Sentiment Analysis Across Languages: Evaluation Before and After Machine Translation to English [0.0]
This paper examines the performance of transformer models in Sentiment Analysis tasks across multilingual datasets and text that has undergone machine translation.
By comparing the effectiveness of these models in different linguistic contexts, we gain insights into their performance variations and potential implications for sentiment analysis across diverse languages.
arXiv Detail & Related papers (2024-05-05T10:52:09Z) - Massively Multilingual Text Translation For Low-Resource Languages [7.3595126380784235]
In humanitarian efforts, translation into severely low-resource languages often does not require a universal translation engine.
While generic translation engines for all languages do not exist, translation of multilingually known limited texts into new, low-resource languages may be possible.
arXiv Detail & Related papers (2024-01-29T21:33:08Z) - Towards a Deep Understanding of Multilingual End-to-End Speech
Translation [52.26739715012842]
We analyze representations learnt in a multilingual end-to-end speech translation model trained over 22 languages.
We derive three major findings from our analysis.
arXiv Detail & Related papers (2023-10-31T13:50:55Z) - Crossing the Threshold: Idiomatic Machine Translation through Retrieval
Augmentation and Loss Weighting [66.02718577386426]
We provide a simple characterization of idiomatic translation and related issues.
We conduct a synthetic experiment revealing a tipping point at which transformer-based machine translation models correctly default to idiomatic translations.
To improve translation of natural idioms, we introduce two straightforward yet effective techniques.
arXiv Detail & Related papers (2023-10-10T23:47:25Z) - The Best of Both Worlds: Combining Human and Machine Translations for
Multilingual Semantic Parsing with Active Learning [50.320178219081484]
We propose an active learning approach that exploits the strengths of both human and machine translations.
An ideal utterance selection can significantly reduce the error and bias in the translated data.
arXiv Detail & Related papers (2023-05-22T05:57:47Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - On the Influence of Machine Translation on Language Origin Obfuscation [0.3437656066916039]
We analyze the ability to detect the source language from the translated output of two widely used commercial machine translation systems.
Evaluations show that the source language can be reconstructed with high accuracy for documents that contain a sufficient amount of translated text.
arXiv Detail & Related papers (2021-06-24T08:33:24Z) - Improving Sentiment Analysis over non-English Tweets using Multilingual
Transformers and Automatic Translation for Data-Augmentation [77.69102711230248]
We propose the use of a multilingual transformer model, that we pre-train over English tweets and apply data-augmentation using automatic translation to adapt the model to non-English languages.
Our experiments in French, Spanish, German and Italian suggest that the proposed technique is an efficient way to improve the results of the transformers over small corpora of tweets in a non-English language.
arXiv Detail & Related papers (2020-10-07T15:44:55Z) - On Learning Language-Invariant Representations for Universal Machine
Translation [33.40094622605891]
Universal machine translation aims to learn to translate between any pair of languages.
We prove certain impossibilities of this endeavour in general and prove positive results in the presence of additional (but natural) structure of data.
We believe our theoretical insights and implications contribute to the future algorithmic design of universal machine translation.
arXiv Detail & Related papers (2020-08-11T04:45:33Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.