Multilingual Search with Subword TF-IDF
- URL: http://arxiv.org/abs/2209.14281v2
- Date: Thu, 29 Sep 2022 02:39:46 GMT
- Title: Multilingual Search with Subword TF-IDF
- Authors: Artit Wangperawong
- Abstract summary: subword TF-IDF (STF-IDF) can offer higher accuracy without suchs.
XQuAD evaluation demonstrates the advantages of STF-IDF: superior information retrieval accuracy of 85.4% for English.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multilingual search can be achieved with subword tokenization. The accuracy
of traditional TF-IDF approaches depend on manually curated tokenization, stop
words and stemming rules, whereas subword TF-IDF (STF-IDF) can offer higher
accuracy without such heuristics. Moreover, multilingual support can be
incorporated inherently as part of the subword tokenization model training.
XQuAD evaluation demonstrates the advantages of STF-IDF: superior information
retrieval accuracy of 85.4% for English and over 80% for 10 other languages
without any heuristics-based preprocessing. The software to reproduce these
results are open-sourced as a part of Text2Text:
https://github.com/artitw/text2text
Related papers
- Introducing Syllable Tokenization for Low-resource Languages: A Case Study with Swahili [29.252250069388687]
Tokenization allows for the words to be split based on characters or subwords, creating word embeddings that best represent the structure of the language.
We propose a syllable tokenizer and adopt an experiment-centric approach to validate the proposed tokenizer based on the Swahili language.
arXiv Detail & Related papers (2024-03-26T17:26:50Z) - A Comparative Study on TF-IDF feature Weighting Method and its Analysis
using Unstructured Dataset [0.5156484100374058]
Term Frequency-Inverse Document Frequency (TF-IDF) and Natural Language Processing (NLP) are the most highly used information retrieval methods in text classification.
We have investigated and analyzed the feature weighting method for text classification on unstructured data.
The proposed model considered two features N-Grams and TF-IDF on IMDB movie reviews and Amazon Alexa reviews dataset for sentiment analysis.
arXiv Detail & Related papers (2023-08-08T04:27:34Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Method for Determining the Similarity of Text Documents for the Kazakh
language, Taking Into Account Synonyms: Extension to TF-IDF [0.0]
The task of determining the similarity of text documents has received considerable attention in many areas such as Information Retrieval.
The term frequency - inverse document frequency (TF-IDF) is the most widely used term weighting method to facilitate the search for relevant documents.
The effectiveness of the method is confirmed by experiments on functions such as Cosine, Dice and Jaccard to measure the similarity of text documents for the Kazakh language.
arXiv Detail & Related papers (2022-11-22T15:54:41Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Looking for Clues of Language in Multilingual BERT to Improve
Cross-lingual Generalization [56.87201892585477]
Token embeddings in multilingual BERT (m-BERT) contain both language and semantic information.
We control the output languages of multilingual BERT by manipulating the token embeddings.
arXiv Detail & Related papers (2020-10-20T05:41:35Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z) - Self-Attention with Cross-Lingual Position Representation [112.05807284056337]
Position encoding (PE) is used to preserve the word order information for natural language processing tasks, generating fixed position indices for input sequences.
Due to word order divergences in different languages, modeling the cross-lingual positional relationships might help SANs tackle this problem.
We augment SANs with emphcross-lingual position representations to model the bilingually aware latent structure for the input sentence.
arXiv Detail & Related papers (2020-04-28T05:23:43Z) - Language-Independent Tokenisation Rivals Language-Specific Tokenisation
for Word Similarity Prediction [12.376752724719005]
Language-independent tokenisation (LIT) methods do not require labelled language resources or lexicons.
Language-specific tokenisation (LST) methods have a long and established history, and are developed using carefully created lexicons and training resources.
We empirically compare the two approaches using semantic similarity measurement as an evaluation task across a diverse set of languages.
arXiv Detail & Related papers (2020-02-25T16:24:42Z) - Semantic Sensitive TF-IDF to Determine Word Relevance in Documents [0.0]
We propose STF-IDF, a novel semantic method based on TF-IDF, for scoring word importance of informal documents in a corpus.
Our method managed to decrease the TF-IDF mean error rate by a factor of 50% and reaching the mean error of 13.7%, as opposed to 27.2% of the original TF-IDF.
arXiv Detail & Related papers (2020-01-06T00:23:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.