IruMozhi: Automatically classifying diglossia in Tamil
- URL: http://arxiv.org/abs/2311.07804v1
- Date: Mon, 13 Nov 2023 23:36:35 GMT
- Title: IruMozhi: Automatically classifying diglossia in Tamil
- Authors: Kabilan Prasanna and Aryaman Arora
- Abstract summary: Spoken Tamil is under-supported in modern NLP systems.
We release IruMozhi, a human-annotated dataset of parallel text in Literary and Spoken Tamil.
- Score: 4.329125081222602
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tamil, a Dravidian language of South Asia, is a highly diglossic language
with two very different registers in everyday use: Literary Tamil (preferred in
writing and formal communication) and Spoken Tamil (confined to speech and
informal media). Spoken Tamil is under-supported in modern NLP systems. In this
paper, we release IruMozhi, a human-annotated dataset of parallel text in
Literary and Spoken Tamil. We train classifiers on the task of identifying
which variety a text belongs to. We use these models to gauge the availability
of pretraining data in Spoken Tamil, to audit the composition of existing
labelled datasets for Tamil, and to encourage future work on the variety.
Related papers
- CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - cantnlp@LT-EDI-2024: Automatic Detection of Anti-LGBTQ+ Hate Speech in
Under-resourced Languages [0.0]
This paper describes our homophobia/transphobia in social media comments detection system developed as part of the shared task at LT-EDI-2024.
We took a transformer-based approach to develop our multiclass classification model for ten language conditions.
We introduced synthetic and organic instances of script-switched language data during domain adaptation to mirror the linguistic realities of social media language.
arXiv Detail & Related papers (2024-01-28T21:58:04Z) - Morphology and Syntax of the Tamil Language [0.0]
The paper highlights the complexity and richness of Tamil in terms of its morphological and syntactic features.
It is proven as a rule-based morphological analyser cum generator and a computational grammar for Tamil have already been developed based on this paper.
arXiv Detail & Related papers (2024-01-16T13:52:25Z) - Tamil-Llama: A New Tamil Language Model Based on Llama 2 [6.449795539095749]
This paper enhances the open-source LLaMA model with an addition of 16,000 Tamil tokens, aiming to achieve superior text generation and comprehension in Tamil language.
We strategically employ the LoRA methodology for efficient model training on a comprehensive Tamil corpus, ensuring computational feasibility and model robustness.
Our results showcase significant performance improvements in Tamil text generation, with potential implications for the broader landscape of Large Language Models in Indian languages.
arXiv Detail & Related papers (2023-11-10T03:02:39Z) - Data and knowledge-driven approaches for multilingual training to
improve the performance of speech recognition systems of Indian languages [0.0]
We propose data and knowledge-driven approaches for multilingual training of the automated speech recognition system for a target language.
In phone/senone mapping, deep neural network (DNN) learns to map senones or phones from one language to the others.
In the other approach, we model the acoustic information for all the languages simultaneously.
arXiv Detail & Related papers (2022-01-24T07:17:17Z) - "A Passage to India": Pre-trained Word Embeddings for Indian Languages [30.607474624873014]
We use various existing approaches to create multiple word embeddings for 14 Indian languages.
We place these embeddings for all these languages in a single repository.
We release a total of 436 models using 8 different approaches.
arXiv Detail & Related papers (2021-12-27T17:31:04Z) - Challenge Dataset of Cognates and False Friend Pairs from Indian
Languages [54.6340870873525]
Cognates are present in multiple variants of the same text across different languages.
In this paper, we describe the creation of two cognate datasets for twelve Indian languages.
arXiv Detail & Related papers (2021-12-17T14:23:43Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - Phoneme Recognition through Fine Tuning of Phonetic Representations: a
Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation.
To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda.
We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Facebook AI's WMT20 News Translation Task Submission [69.92594751788403]
This paper describes Facebook AI's submission to WMT20 shared news translation task.
We focus on the low resource setting and participate in two language pairs, Tamil -> English and Inuktitut -> English.
We approach the low resource problem using two main strategies, leveraging all available data and adapting the system to the target news domain.
arXiv Detail & Related papers (2020-11-16T21:49:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.