Regular Expressions for Fast-response COVID-19 Text Classification
- URL: http://arxiv.org/abs/2102.09507v2
- Date: Fri, 19 Feb 2021 19:23:33 GMT
- Title: Regular Expressions for Fast-response COVID-19 Text Classification
- Authors: Igor L. Markov, Jacqueline Liu, Adam Vagner
- Abstract summary: Facebook determines if a piece of text belongs to a narrow topic such as COVID-19.
We employ human-guided iterations of keyword discovery, but do not require labeled data.
Regular expressions enable low-latency queries from multiple platforms.
- Score: 1.1279808969568252
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text classifiers are at the core of many NLP applications and use a variety
of algorithmic approaches and software. This paper describes how Facebook
determines if a given piece of text - anything from a hashtag to a post -
belongs to a narrow topic such as COVID-19. To fully define a topic and
evaluate classifier performance we employ human-guided iterations of keyword
discovery, but do not require labeled data. For COVID-19, we build two sets of
regular expressions: (1) for 66 languages, with 99% precision and recall >50%,
(2) for the 11 most common languages, with precision >90% and recall >90%.
Regular expressions enable low-latency queries from multiple platforms.
Response to challenges like COVID-19 is fast and so are revisions. Comparisons
to a DNN classifier show explainable results, higher precision and recall, and
less overfitting. Our learnings can be applied to other narrow-topic
classifiers.
Related papers
- Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval.
We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning.
On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z) - TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation [53.974228542090046]
Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks.
Existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes.
We propose TagCLIP (Trusty-aware guided CLIP) to address this issue.
arXiv Detail & Related papers (2023-04-15T12:52:23Z) - M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios [103.6153593636399]
We propose a vision-language prompt tuning method with mitigated label bias (M-Tuning)
It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario.
Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
arXiv Detail & Related papers (2023-03-09T09:05:47Z) - Better Than Whitespace: Information Retrieval for Languages without
Custom Tokenizers [48.036317742487796]
We propose a new approach to tokenization for lexical matching retrieval algorithms.
We use the WordPiece tokenizer, which can be built automatically from unsupervised data.
Results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages.
arXiv Detail & Related papers (2022-10-11T14:32:46Z) - JARVix at SemEval-2022 Task 2: It Takes One to Know One? Idiomaticity
Detection using Zero and One Shot Learning [7.453634424442979]
In this paper, we focus on the detection of idiomatic expressions by using binary classification.
We use a dataset consisting of the literal and idiomatic usage of MWEs in English and Portuguese.
We train multiple Large Language Models in both the settings and achieve an F1 score (macro) of 0.73 for the zero shot setting and an F1 score (macro) of 0.85 for the one shot setting.
arXiv Detail & Related papers (2022-02-04T21:17:41Z) - Language Identification with a Reciprocal Rank Classifier [1.4467794332678539]
We present a lightweight and effective language identifier that is robust to changes of domain and to the absence of training data.
We test this on two 22-language data sets and demonstrate zero-effort domain adaptation from a Wikipedia training set to a Twitter test set.
arXiv Detail & Related papers (2021-09-20T22:10:07Z) - Detecting Handwritten Mathematical Terms with Sensor Based Data [71.84852429039881]
We propose a solution to the UbiComp 2021 Challenge by Stabilo in which handwritten mathematical terms are supposed to be automatically classified.
The input data set contains data of different writers, with label strings constructed from a total of 15 different possible characters.
arXiv Detail & Related papers (2021-09-12T19:33:34Z) - Evaluating Various Tokenizers for Arabic Text Classification [4.110108749051656]
We introduce three new tokenization algorithms for Arabic and compare them to three other baselines using unsupervised evaluations.
Our experiments show that the performance of such tokenization algorithms depends on the size of the dataset, type of the task, and the amount of morphology that exists in the dataset.
arXiv Detail & Related papers (2021-06-14T16:05:58Z) - Multi-view Subword Regularization [111.04350390045705]
Multi-view Subword Regularization (MVR) is a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations.
Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.
arXiv Detail & Related papers (2021-03-15T16:07:42Z) - Novel Keyword Extraction and Language Detection Approaches [0.6445605125467573]
We propose a fast novel approach to string tokenisation for fuzzy language matching.
We experimentally demonstrate an 83.6% decrease in processing time.
We find the Accept-Language header is 14% more likely to match the classification than the IP Address.
arXiv Detail & Related papers (2020-09-24T17:28:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.