KpopMT: Translation Dataset with Terminology for Kpop Fandom
- URL: http://arxiv.org/abs/2407.07413v1
- Date: Wed, 10 Jul 2024 07:14:51 GMT
- Title: KpopMT: Translation Dataset with Terminology for Kpop Fandom
- Authors: JiWoo Kim, Yunsu Kim, JinYeong Bak,
- Abstract summary: Expert translators provide 1k English translations for Korean posts and comments.
We evaluate existing translation systems including GPT models on KpopMT to identify their failure cases.
- Score: 5.464669506214195
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: While machines learn from existing corpora, humans have the unique capability to establish and accept new language systems. This makes human form unique language systems within social groups. Aligning with this, we focus on a gap remaining in addressing translation challenges within social groups, where in-group members utilize unique terminologies. We propose KpopMT dataset, which aims to fill this gap by enabling precise terminology translation, choosing Kpop fandom as an initiative for social groups given its global popularity. Expert translators provide 1k English translations for Korean posts and comments, each annotated with specific terminology within social groups' language systems. We evaluate existing translation systems including GPT models on KpopMT to identify their failure cases. Results show overall low scores, underscoring the challenges of reflecting group-specific terminologies and styles in translation. We make KpopMT publicly available.
Related papers
- A Curious Class of Adpositional Multiword Expressions in Korean [10.449742937121014]
Multiword expressions (MWEs) have been widely studied in cross-lingual annotation frameworks such as PARSEME.<n>In this paper, we study a class of Korean functional multiword expressions: postpositional verb-based constructions (PVCs)
arXiv Detail & Related papers (2026-02-17T21:23:16Z) - Simultaneous Speech-to-Speech Translation Without Aligned Data [52.467808474293605]
Simultaneous speech translation requires translating source speech into a target language in real-time.<n>We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely.<n>Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks.
arXiv Detail & Related papers (2026-02-11T17:41:01Z) - Difficult for Whom? A Study of Japanese Lexical Complexity [12.038720850970213]
We show that a recent Japanese LCP dataset is representative of its target population by partially replicating the annotation.
By another reannotation we show that native Chinese speakers perceive the complexity differently due to Sino-Japanese vocabulary.
We show that the model trained on a group mean performs similarly to an individual model in the CWI task, while achieving good LCP performance for an individual is difficult.
arXiv Detail & Related papers (2024-10-24T09:18:53Z) - Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages [55.157295899188476]
neural machine translation systems learn to map sentences of different languages into a common representation space.
In this work, we test this hypothesis by zero-shot translating from unseen languages.
We demonstrate that this setup enables zero-shot translation from entirely unseen languages.
arXiv Detail & Related papers (2024-08-05T07:58:58Z) - Does Incomplete Syntax Influence Korean Language Model? Focusing on Word Order and Case Markers [7.275938266030414]
Syntactic elements, such as word order and case markers, are fundamental in natural language processing.
This study explores whether Korean language models can accurately capture this flexibility.
arXiv Detail & Related papers (2024-07-12T11:33:41Z) - K-pop Lyric Translation: Dataset, Analysis, and Neural-Modelling [7.819710421921816]
We introduce a novel singable lyric translation dataset, approximately 89% of which consists of K-pop song lyrics.
This dataset aligns Korean and English lyrics line-by-line and section-by-section.
We construct a neural lyric translation model, thereby underscoring the importance of a dedicated dataset for singable lyric translations.
arXiv Detail & Related papers (2023-09-20T06:54:55Z) - K-UniMorph: Korean Universal Morphology and its Feature Schema [1.3048920509133806]
We present a new Universal Morphology dataset for Korean.
We outline each grammatical criterion in detail for the verbal endings, clarify how to extract inflected forms, and demonstrate how we generate the morphological schemata.
We carry out the inflection task using three different Korean word forms: letters, syllables and morphemes.
arXiv Detail & Related papers (2023-05-10T17:44:01Z) - "I'm" Lost in Translation: Pronoun Missteps in Crowdsourced Data Sets [13.32560004325655]
Crowdsourcing initiatives have focused on multilingual translation of big, open data sets for use in natural language processing (NLP)
We focus on the case of pronouns translated between English and Japanese in the crowdsourced Tatoeba database.
We found that masculine pronoun biases were present overall, even though plurality in language was accounted for in other ways.
arXiv Detail & Related papers (2023-04-22T09:27:32Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - DICTDIS: Dictionary Constrained Disambiguation for Improved NMT [50.888881348723295]
We present DictDis, a lexically constrained NMT system that disambiguates between multiple candidate translations derived from dictionaries.
We demonstrate the utility of DictDis via extensive experiments on English-Hindi and English-German sentences in a variety of domains including regulatory, finance, engineering.
arXiv Detail & Related papers (2022-10-13T13:04:16Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Decoding and Diversity in Machine Translation [90.33636694717954]
We characterize differences between cost diversity paid for the BLEU scores enjoyed by NMT.
Our study implicates search as a salient source of known bias when translating gender pronouns.
arXiv Detail & Related papers (2020-11-26T21:09:38Z) - The Paradigm Discovery Problem [121.79963594279893]
We formalize the paradigm discovery problem and develop metrics for judging systems.
We report empirical results on five diverse languages.
Our code and data are available for public use.
arXiv Detail & Related papers (2020-05-04T16:38:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.