JWSign: A Highly Multilingual Corpus of Bible Translations for more
Diversity in Sign Language Processing
- URL: http://arxiv.org/abs/2311.10174v1
- Date: Thu, 16 Nov 2023 20:02:44 GMT
- Title: JWSign: A Highly Multilingual Corpus of Bible Translations for more
Diversity in Sign Language Processing
- Authors: Shester Gueuwou, Sophie Siake, Colin Leong and Mathias M\"uller
- Abstract summary: JWSign dataset consists of 2,530 hours of Bible translations in 98 sign languages.
We train multilingual systems, including some that take into account the typological relatedness of signed or spoken languages.
- Score: 2.9936326613596775
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Advancements in sign language processing have been hindered by a lack of
sufficient data, impeding progress in recognition, translation, and production
tasks. The absence of comprehensive sign language datasets across the world's
sign languages has widened the gap in this field, resulting in a few sign
languages being studied more than others, making this research area extremely
skewed mostly towards sign languages from high-income countries. In this work
we introduce a new large and highly multilingual dataset for sign language
translation: JWSign. The dataset consists of 2,530 hours of Bible translations
in 98 sign languages, featuring more than 1,500 individual signers. On this
dataset, we report neural machine translation experiments. Apart from bilingual
baseline systems, we also train multilingual systems, including some that take
into account the typological relatedness of signed or spoken languages. Our
experiments highlight that multilingual systems are superior to bilingual
baselines, and that in higher-resource scenarios, clustering language pairs
that are related improves translation quality.
Related papers
- NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - ISLTranslate: Dataset for Translating Indian Sign Language [4.836352379142503]
This paper introduces ISLTranslate, a translation dataset for continuous Indian Sign Language (ISL) consisting of 31k ISL-English sentence/phrase pairs.
To the best of our knowledge, it is the largest translation dataset for continuous Indian Sign Language.
arXiv Detail & Related papers (2023-07-11T17:06:52Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Machine Translation between Spoken Languages and Signed Languages
Represented in SignWriting [5.17427644066658]
We introduce novel methods to parse, factorize, decode, and evaluate SignWriting, leveraging ideas from neural factored MT.
We find that common MT techniques used to improve spoken language translation similarly affect the performance of sign language translation.
arXiv Detail & Related papers (2022-10-11T12:28:06Z) - How Do Multilingual Encoders Learn Cross-lingual Representation? [8.409283426564977]
Cross-lingual transfer benefits languages with little to no training data by transferring from other languages.
This thesis first shows such surprising cross-lingual effectiveness compared against prior art on various tasks.
We also look at how to inject different cross-lingual signals into multilingual encoders, and the optimization behavior of cross-lingual transfer with these models.
arXiv Detail & Related papers (2022-07-12T17:57:05Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - Alternative Input Signals Ease Transfer in Multilingual Machine
Translation [21.088829932208945]
We tackle inhibited transfer by augmenting the training data with alternative signals that unify different writing systems.
We test these signals on Indic and Turkic languages, two language families where the writing systems differ but languages still share common features.
arXiv Detail & Related papers (2021-10-15T01:56:46Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - MuRIL: Multilingual Representations for Indian Languages [3.529875637780551]
India is a multilingual society with 1369 rationalized languages and dialects being spoken across the country.
Despite this, today's state-of-the-art multilingual systems perform suboptimally on Indian (IN) languages.
We propose MuRIL, a multilingual language model specifically built for IN languages.
arXiv Detail & Related papers (2021-03-19T11:06:37Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.