Language Identification with a Reciprocal Rank Classifier
- URL: http://arxiv.org/abs/2109.09862v1
- Date: Mon, 20 Sep 2021 22:10:07 GMT
- Title: Language Identification with a Reciprocal Rank Classifier
- Authors: Dominic Widdows and Chris Brew
- Abstract summary: We present a lightweight and effective language identifier that is robust to changes of domain and to the absence of training data.
We test this on two 22-language data sets and demonstrate zero-effort domain adaptation from a Wikipedia training set to a Twitter test set.
- Score: 1.4467794332678539
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language identification is a critical component of language processing
pipelines (Jauhiainen et al.,2019) and is not a solved problem in real-world
settings. We present a lightweight and effective language identifier that is
robust to changes of domain and to the absence of copious training data.
The key idea for classification is that the reciprocal of the rank in a
frequency table makes an effective additive feature score, hence the term
Reciprocal Rank Classifier (RRC). The key finding for language classification
is that ranked lists of words and frequencies of characters form a sufficient
and robust representation of the regularities of key languages and their
orthographies.
We test this on two 22-language data sets and demonstrate zero-effort domain
adaptation from a Wikipedia training set to a Twitter test set. When trained on
Wikipedia but applied to Twitter the macro-averaged F1-score of a
conventionally trained SVM classifier drops from 90.9% to 77.7%. By contrast,
the macro F1-score of RRC drops only from 93.1% to 90.6%. These classifiers are
compared with those from fastText and langid. The RRC performs better than
these established systems in most experiments, especially on short Wikipedia
texts and Twitter.
The RRC classifier can be improved for particular domains and conversational
situations by adding words to the ranked lists. Using new terms learned from
such conversations, we demonstrate a further 7.9% increase in accuracy of
sample message classification, and 1.7% increase for conversation
classification. Surprisingly, this made results on Twitter data slightly worse.
The RRC classifier is available as an open source Python package
(https://github.com/LivePersonInc/lplangid).
Related papers
- Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels.
By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data.
The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z) - Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval.
We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning.
On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z) - Enhancing Visual Continual Learning with Language-Guided Supervision [76.38481740848434]
Continual learning aims to empower models to learn new tasks without forgetting previously acquired knowledge.
We argue that the scarce semantic information conveyed by the one-hot labels hampers the effective knowledge transfer across tasks.
Specifically, we use PLMs to generate semantic targets for each class, which are frozen and serve as supervision signals.
arXiv Detail & Related papers (2024-03-24T12:41:58Z) - Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC)
We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages.
Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z) - SpellMapper: A non-autoregressive neural spellchecker for ASR
customization with candidate retrieval based on n-gram mappings [76.87664008338317]
Contextual spelling correction models are an alternative to shallow fusion to improve automatic speech recognition.
We propose a novel algorithm for candidate retrieval based on misspelled n-gram mappings.
Experiments on Spoken Wikipedia show 21.4% word error rate improvement compared to a baseline ASR system.
arXiv Detail & Related papers (2023-06-04T10:00:12Z) - MRN: Multiplexed Routing Network for Incremental Multilingual Text
Recognition [56.408324994409405]
Multiplexed routing network (MRN) trains a recognizer for each language that is currently seen.
MRN effectively reduces the reliance on older data and better fights against catastrophic forgetting.
It outperforms existing general-purpose IL methods by large margins.
arXiv Detail & Related papers (2023-05-24T06:03:34Z) - A transformer-based spelling error correction framework for Bangla and resource scarce Indic languages [2.5874041837241304]
Spelling error correction is the task of identifying and rectifying misspelled words in texts.
Earlier efforts on spelling error correction in Bangla and resource-scarce Indic languages focused on rule-based, statistical, and machine learning-based methods.
We propose a novel detector-purificator-corrector, DPC based on denoising transformers by addressing previous issues.
arXiv Detail & Related papers (2022-11-07T17:59:05Z) - JARVix at SemEval-2022 Task 2: It Takes One to Know One? Idiomaticity
Detection using Zero and One Shot Learning [7.453634424442979]
In this paper, we focus on the detection of idiomatic expressions by using binary classification.
We use a dataset consisting of the literal and idiomatic usage of MWEs in English and Portuguese.
We train multiple Large Language Models in both the settings and achieve an F1 score (macro) of 0.73 for the zero shot setting and an F1 score (macro) of 0.85 for the one shot setting.
arXiv Detail & Related papers (2022-02-04T21:17:41Z) - Regular Expressions for Fast-response COVID-19 Text Classification [1.1279808969568252]
Facebook determines if a piece of text belongs to a narrow topic such as COVID-19.
We employ human-guided iterations of keyword discovery, but do not require labeled data.
Regular expressions enable low-latency queries from multiple platforms.
arXiv Detail & Related papers (2021-02-18T17:48:49Z) - Voice@SRIB at SemEval-2020 Task 9 and 12: Stacked Ensembling method for
Sentiment and Offensiveness detection in Social Media [2.9008108937701333]
We train embeddings, ensembling methods for Sentimix, and OffensEval tasks.
We evaluate our models on macro F1-score, precision, accuracy, and recall on the datasets.
arXiv Detail & Related papers (2020-07-20T11:54:43Z) - Text Complexity Classification Based on Linguistic Information:
Application to Intelligent Tutoring of ESL [0.0]
The goal of this work is to build a classifier that can identify text complexity within the context of teaching reading to English as a Second Language ( ESL) learners.
Using a corpus of 6171 texts, which had already been classified into three different levels of difficulty by ESL experts, different experiments were conducted with five machine learning algorithms.
The results showed that the adopted linguistic features provide a good overall classification performance.
arXiv Detail & Related papers (2020-01-07T02:42:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.