Preparation of Sentiment tagged Parallel Corpus and Testing its effect
on Machine Translation
- URL: http://arxiv.org/abs/2007.14074v1
- Date: Tue, 28 Jul 2020 09:04:47 GMT
- Title: Preparation of Sentiment tagged Parallel Corpus and Testing its effect
on Machine Translation
- Authors: Sainik Kumar Mahata, Amrita Chandra, Dipankar Das, Sivaji
Bandyopadhyay
- Abstract summary: The paper discusses the preparation of the same sentiment tagged English-Bengali parallel corpus.
The output of the translation model has been compared with a base-line translation model using automated metrics such as BLEU and TER.
- Score: 12.447116722795899
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the current work, we explore the enrichment in the machine translation
output when the training parallel corpus is augmented with the introduction of
sentiment analysis. The paper discusses the preparation of the same sentiment
tagged English-Bengali parallel corpus. The preparation of raw parallel corpus,
sentiment analysis of the sentences and the training of a Character Based
Neural Machine Translation model using the same has been discussed extensively
in this paper. The output of the translation model has been compared with a
base-line translation model using automated metrics such as BLEU and TER as
well as manually.
Related papers
- Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - Decomposed Prompting for Machine Translation Between Related Languages
using Large Language Models [55.35106713257871]
We introduce DecoMT, a novel approach of few-shot prompting that decomposes the translation process into a sequence of word chunk translations.
We show that DecoMT outperforms the strong few-shot prompting BLOOM model with an average improvement of 8 chrF++ scores across the examined languages.
arXiv Detail & Related papers (2023-05-22T14:52:47Z) - HanoiT: Enhancing Context-aware Translation via Selective Context [95.93730812799798]
Context-aware neural machine translation aims to use the document-level context to improve translation quality.
The irrelevant or trivial words may bring some noise and distract the model from learning the relationship between the current sentence and the auxiliary context.
We propose a novel end-to-end encoder-decoder model with a layer-wise selection mechanism to sift and refine the long document context.
arXiv Detail & Related papers (2023-01-17T12:07:13Z) - A Bilingual Parallel Corpus with Discourse Annotations [82.07304301996562]
This paper describes BWB, a large parallel corpus first introduced in Jiang et al. (2022), along with an annotated test set.
The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena.
arXiv Detail & Related papers (2022-10-26T12:33:53Z) - Extended Parallel Corpus for Amharic-English Machine Translation [0.0]
It will be useful for machine translation of an under-resourced language, Amharic.
We trained neural machine translation and phrase-based statistical machine translation models using the corpus.
arXiv Detail & Related papers (2021-04-08T06:51:08Z) - Designing the Business Conversation Corpus [20.491255702901288]
We aim to boost the machine translation quality of conversational texts by introducing a newly constructed Japanese-English business conversation parallel corpus.
A detailed analysis of the corpus is provided along with challenging examples for automatic translation.
We also experiment with adding the corpus in a machine translation training scenario and show how the resulting system benefits from its use.
arXiv Detail & Related papers (2020-08-05T05:19:44Z) - Parallel Corpus Filtering via Pre-trained Language Models [14.689457985200141]
Web-crawled data provides a good source of parallel corpora for training machine translation models.
Recent work shows that neural machine translation systems are more sensitive to noise than traditional statistical machine translation methods.
We propose a novel approach to filter out noisy sentence pairs from web-crawled corpora via pre-trained language models.
arXiv Detail & Related papers (2020-05-13T06:06:23Z) - Contextual Neural Machine Translation Improves Translation of Cataphoric
Pronouns [50.245845110446496]
We investigate the effect of future sentences as context by comparing the performance of a contextual NMT model trained with the future context to the one trained with the past context.
Our experiments and evaluation, using generic and pronoun-focused automatic metrics, show that the use of future context achieves significant improvements over the context-agnostic Transformer.
arXiv Detail & Related papers (2020-04-21T10:45:48Z) - Learning Contextualized Sentence Representations for Document-Level
Neural Machine Translation [59.191079800436114]
Document-level machine translation incorporates inter-sentential dependencies into the translation of a source sentence.
We propose a new framework to model cross-sentence dependencies by training neural machine translation (NMT) to predict both the target translation and surrounding sentences of a source sentence.
arXiv Detail & Related papers (2020-03-30T03:38:01Z) - Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures
Translation [37.04364877980479]
We show how to mine a parallel corpus from publicly available lectures at Coursera.
Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations.
For Japanese--English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets.
arXiv Detail & Related papers (2019-12-26T01:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.