Towards automatic identification of linguistic politeness in Hindi texts
- URL: http://arxiv.org/abs/2111.15268v1
- Date: Tue, 30 Nov 2021 10:32:17 GMT
- Title: Towards automatic identification of linguistic politeness in Hindi texts
- Authors: Ritesh Kumar
- Abstract summary: I have used the manually annotated corpus of over 25,000 blog comments to train an SVM.
The trained system gives a significantly high accuracy of over 77% which is within 2% of human accuracy.
- Score: 1.2691047660244332
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper I present a classifier for automatic identification of
linguistic politeness in Hindi texts. I have used the manually annotated corpus
of over 25,000 blog comments to train an SVM. Making use of the discursive and
interactional approaches to politeness the paper gives an exposition of the
normative, conventionalised politeness structures of Hindi. It is seen that
using these manually recognised structures as features in training the SVM
significantly improves the performance of the classifier on the test set. The
trained system gives a significantly high accuracy of over 77% which is within
2% of human accuracy.
Related papers
- Improving Informally Romanized Language Identification [49.404145019682666]
Romanization renders languages that are normally easily distinguished based on script highly confusable, such as Hindi and Urdu.
We increase language identification (LID) accuracy for romanized text by improving the methods used to synthesize training sets.
We demonstrate new state-of-the-art LID performance on romanized text from 20 Indic languages in the Bhasha-Abhijnaanam evaluation set.
arXiv Detail & Related papers (2025-04-30T11:36:28Z) - Automatic Speech Recognition for Non-Native English: Accuracy and Disfluency Handling [0.0]
This study assesses five cutting-edge ASR systems' recognition of non-native English accented speech using recordings from the L2-ARCTIC corpus.
For read speech, Whisper and AssemblyAI achieved the best accuracy with mean Match Error Rates (MER) of 0.054 and 0.056 respectively.
For spontaneous speech, RevAI performed best with a mean MER of 0.063.
arXiv Detail & Related papers (2025-03-10T05:09:44Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - cantnlp@LT-EDI-2024: Automatic Detection of Anti-LGBTQ+ Hate Speech in
Under-resourced Languages [0.0]
This paper describes our homophobia/transphobia in social media comments detection system developed as part of the shared task at LT-EDI-2024.
We took a transformer-based approach to develop our multiclass classification model for ten language conditions.
We introduced synthetic and organic instances of script-switched language data during domain adaptation to mirror the linguistic realities of social media language.
arXiv Detail & Related papers (2024-01-28T21:58:04Z) - Cross-Lingual Speaker Identification Using Distant Supervision [84.51121411280134]
We propose a speaker identification framework that addresses issues such as lack of contextual reasoning and poor cross-lingual generalization.
We show that the resulting model outperforms previous state-of-the-art methods on two English speaker identification benchmarks by up to 9% in accuracy and 5% with only distant supervision.
arXiv Detail & Related papers (2022-10-11T20:49:44Z) - Handwriting recognition and automatic scoring for descriptive answers in
Japanese language tests [7.489722641968594]
This paper presents an experiment of automatically scoring handwritten descriptive answers in the trial tests for the new Japanese university entrance examination.
Although all answers have been scored by human examiners, handwritten characters are not labeled.
We present our attempt to adapt deep neural network-based handwriting recognizers trained on a labeled handwriting dataset into this unlabeled answer set.
arXiv Detail & Related papers (2022-01-10T08:47:52Z) - Prosody Labelled Dataset for Hindi using Semi-Automated Approach [0.19733467999508417]
This study aims to develop a semi-automatically labelled prosody database for Hindi.
No single standard for prosody labelling exists in Hindi.
The accuracy of the trained models for pitch accent, intermediate phrase boundaries and accentual phrase boundaries is 73.40%, 93.20%, and 43% respectively.
arXiv Detail & Related papers (2021-12-11T13:11:36Z) - Support Vector Machine for Handwritten Character Recognition [0.0]
A database of 10,000 character samples of 44 basic Malayalam characters is used in this work.
A discriminate feature set of 64 local and 4 global features are used to train and test SVM classifier and achieved 92.24% accuracy.
arXiv Detail & Related papers (2021-09-07T13:36:12Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - An Attention Ensemble Approach for Efficient Text Classification of
Indian Languages [0.0]
This paper focuses on the coarse-grained technical domain identification of short text documents in Marathi, a Devanagari script-based Indian language.
A hybrid CNN-BiLSTM attention ensemble model is proposed that competently combines the intermediate sentence representations generated by the convolutional neural network and the bidirectional long short-term memory, leading to efficient text classification.
Experimental results show that the proposed model outperforms various baseline machine learning and deep learning models in the given task, giving the best validation accuracy of 89.57% and f1-score of 0.8875.
arXiv Detail & Related papers (2021-02-20T07:31:38Z) - Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models.
We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z) - Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text.
We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages.
We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z) - Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for
Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models.
Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.