evaluating bert and parsbert for analyzing persian advertisement data
- URL: http://arxiv.org/abs/2305.02426v1
- Date: Wed, 3 May 2023 20:50:05 GMT
- Title: evaluating bert and parsbert for analyzing persian advertisement data
- Authors: Ali Mehrban, Pegah Ahadian
- Abstract summary: The paper uses the example of Divar, an online marketplace for buying and selling products and services in Iran.
It presents a competition to predict the percentage of a car sales ad that would be published on the Divar website.
Since the dataset provides a rich source of Persian text data, the authors use the Hazm library, a Python library designed for processing Persian text, and two state-of-the-art language models, mBERT and ParsBERT, to analyze it.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper discusses the impact of the Internet on modern trading and the
importance of data generated from these transactions for organizations to
improve their marketing efforts. The paper uses the example of Divar, an online
marketplace for buying and selling products and services in Iran, and presents
a competition to predict the percentage of a car sales ad that would be
published on the Divar website. Since the dataset provides a rich source of
Persian text data, the authors use the Hazm library, a Python library designed
for processing Persian text, and two state-of-the-art language models, mBERT
and ParsBERT, to analyze it. The paper's primary objective is to compare the
performance of mBERT and ParsBERT on the Divar dataset. The authors provide
some background on data mining, Persian language, and the two language models,
examine the dataset's composition and statistical features, and provide details
on their fine-tuning and training configurations for both approaches. They
present the results of their analysis and highlight the strengths and
weaknesses of the two language models when applied to Persian text data. The
paper offers valuable insights into the challenges and opportunities of working
with low-resource languages such as Persian and the potential of advanced
language models like BERT for analyzing such data. The paper also explains the
data mining process, including steps such as data cleaning and normalization
techniques. Finally, the paper discusses the types of machine learning
problems, such as supervised, unsupervised, and reinforcement learning, and the
pattern evaluation techniques, such as confusion matrix. Overall, the paper
provides an informative overview of the use of language models and data mining
techniques for analyzing text data in low-resource languages, using the example
of the Divar dataset.
Related papers
- OPSD: an Offensive Persian Social media Dataset and its baseline evaluations [2.356562319390226]
This paper introduces two offensive datasets for Persian language.
The first dataset comprises annotations provided by domain experts, while the second consists of a large collection of unlabeled data obtained through web crawling.
The obtained F1-scores for the three-class and two-class versions of the dataset were 76.9% and 89.9% for XLM-RoBERTa, respectively.
arXiv Detail & Related papers (2024-04-08T14:08:56Z) - Multi-dimensional data refining strategy for effective fine-tuning LLMs [2.67766280323297]
This paper presents lessons learned while crawling and refining data tailored for fine-tuning Vietnamese language models.
Our paper presents a multidimensional strategy including leveraging existing datasets in the English language and developing customized data-crawling scripts with the assistance of generative AI tools.
arXiv Detail & Related papers (2023-11-02T07:50:43Z) - Studying the impacts of pre-training using ChatGPT-generated text on
downstream tasks [0.0]
Our research aims to investigate the influence of artificial text in the pre-training phase of language models.
We conducted a comparative analysis between a language model, RoBERTa, pre-trained using CNN/DailyMail news articles, and ChatGPT, which employed the same articles for its training.
We demonstrate that the utilization of artificial text during pre-training does not have a significant impact on either the performance of the models in downstream tasks or their gender bias.
arXiv Detail & Related papers (2023-09-02T12:56:15Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - TunBERT: Pretrained Contextualized Text Representation for Tunisian
Dialect [0.0]
We investigate the feasibility of training monolingual Transformer-based language models for under represented languages.
We show that the use of noisy web crawled data instead of structured data is more convenient for such non-standardized language.
Our best performing TunBERT model reaches or improves the state-of-the-art in all three downstream tasks.
arXiv Detail & Related papers (2021-11-25T15:49:50Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Multilingual Neural Semantic Parsing for Low-Resourced Languages [1.6244541005112747]
We introduce a new multilingual semantic parsing dataset in English, Italian and Japanese.
We show that joint multilingual training with pretrained encoders substantially outperforms our baselines on the TOP dataset.
We find that a semantic trained only on English data achieves a zero-shot performance of 44.9% exact-match accuracy on Italian sentences.
arXiv Detail & Related papers (2021-06-07T09:53:02Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.