Novel Keyword Extraction and Language Detection Approaches
- URL: http://arxiv.org/abs/2009.11832v1
- Date: Thu, 24 Sep 2020 17:28:59 GMT
- Title: Novel Keyword Extraction and Language Detection Approaches
- Authors: Malgorzata Pikies, Andronicus Riyono, Junade Ali
- Abstract summary: We propose a fast novel approach to string tokenisation for fuzzy language matching.
We experimentally demonstrate an 83.6% decrease in processing time.
We find the Accept-Language header is 14% more likely to match the classification than the IP Address.
- Score: 0.6445605125467573
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fuzzy string matching and language classification are important tools in
Natural Language Processing pipelines, this paper provides advances in both
areas. We propose a fast novel approach to string tokenisation for fuzzy
language matching and experimentally demonstrate an 83.6% decrease in
processing time with an estimated improvement in recall of 3.1% at the cost of
a 2.6% decrease in precision. This approach is able to work even where keywords
are subdivided into multiple words, without needing to scan
character-to-character. So far there has been little work considering using
metadata to enhance language classification algorithms. We provide
observational data and find the Accept-Language header is 14% more likely to
match the classification than the IP Address.
Related papers
- CharSS: Character-Level Transformer Model for Sanskrit Word Segmentation [39.08623113730563]
Subword tokens in Indian languages inherently carry meaning, and isolating them can enhance NLP tasks.
We propose a new approach of utilizing a Character-level Transformer model for Sanskrit Word (CharSS)
We perform experiments on three benchmark datasets to compare the performance of our method against existing methods.
arXiv Detail & Related papers (2024-07-08T18:50:13Z) - Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval.
We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning.
On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Better Than Whitespace: Information Retrieval for Languages without
Custom Tokenizers [48.036317742487796]
We propose a new approach to tokenization for lexical matching retrieval algorithms.
We use the WordPiece tokenizer, which can be built automatically from unsupervised data.
Results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages.
arXiv Detail & Related papers (2022-10-11T14:32:46Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks.
Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it.
In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z) - Massively Multilingual Document Alignment with Cross-lingual
Sentence-Mover's Distance [8.395430195053061]
Document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other.
We develop an unsupervised scoring function that leverages cross-lingual sentence embeddings to compute the semantic distance between documents in different languages.
These semantic distances are then used to guide a document alignment algorithm to properly pair cross-lingual web documents across a variety of low, mid, and high-resource language pairs.
arXiv Detail & Related papers (2020-01-31T05:14:16Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z) - Machine Learning Approaches for Amharic Parts-of-speech Tagging [0.0]
Performance of the current POS taggers in Amharic is not as good as that of the contemporary POS taggers available for English and other European languages.
The aim of this work is to improve POS tagging performance for the Amharic language, which was never above 91%.
arXiv Detail & Related papers (2020-01-10T06:40:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.