Building domain specific lexicon based on TikTok comment dataset
- URL: http://arxiv.org/abs/2012.08773v1
- Date: Wed, 16 Dec 2020 07:26:43 GMT
- Title: Building domain specific lexicon based on TikTok comment dataset
- Authors: Hao Jiaxiang
- Abstract summary: Previous research focused on sentiment analysis in English, for example, analyzing the sentiment tendency of sentences based on Valence, Arousal, Dominance of sentences.
This paper tried a method that builds a domain-specific lexicon.
The model can classify Chinese words with emotional tendency.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the sentiment analysis task, predicting the sentiment tendency of a
sentence is an important branch. Previous research focused more on sentiment
analysis in English, for example, analyzing the sentiment tendency of sentences
based on Valence, Arousal, Dominance of sentences. the emotional tendency is
different between the two languages. For example, the sentence order between
Chinese and English may present different emotions. This paper tried a method
that builds a domain-specific lexicon. In this way, the model can classify
Chinese words with emotional tendency. In this approach, based on the [13], an
ultra-dense space embedding table is trained through word embedding of Chinese
TikTok review and emotional lexicon sources(seed words). The result of the
model is a domain-specific lexicon, which presents the emotional tendency of
words. I collected Chinese TikTok comments as training data. By comparing The
training results with the PCA method to evaluate the performance of the model
in Chinese sentiment classification, the results show that the model has done
well in Chinese. The source code has released on
github:https://github.com/h2222/douyin_comment_dataset
Related papers
- Lexicon-Based Sentiment Analysis on Text Polarities with Evaluation of Classification Models [1.342834401139078]
This work uses a lexicon-based method to perform sentiment analysis and shows an evaluation of classification models trained over textual data.
The lexicon-based methods identify the intensity of emotion and subjectivity at word levels.
This work is based on a multi-class problem of text being labeled as positive, negative, or neutral.
arXiv Detail & Related papers (2024-09-19T15:31:12Z) - A Comparison of Lexicon-Based and ML-Based Sentiment Analysis: Are There
Outlier Words? [14.816706893177997]
In this paper we compute sentiment for more than 150,000 English language texts drawn from 4 domains.
We model differences in sentiment scores between approaches for documents in each domain using a regression.
Our findings are that the importance of a word depends on the domain and there are no standout lexical entries which systematically cause differences in sentiment scores.
arXiv Detail & Related papers (2023-11-10T18:21:50Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Discourse Representation Structure Parsing for Chinese [8.846860617823005]
We explore the feasibility of Chinese semantic parsing in the absence of labeled data for Chinese meaning representations.
We propose a test suite designed explicitly for Chinese semantic parsing, which provides fine-grained evaluation for parsing performance.
Our experimental results show that the difficulty of Chinese semantic parsing is mainly caused by adverbs.
arXiv Detail & Related papers (2023-06-16T09:47:45Z) - Comparing Biases and the Impact of Multilingual Training across Multiple
Languages [70.84047257764405]
We present a bias analysis across Italian, Chinese, English, Hebrew, and Spanish on the downstream sentiment analysis task.
We adapt existing sentiment bias templates in English to Italian, Chinese, Hebrew, and Spanish for four attributes: race, religion, nationality, and gender.
Our results reveal similarities in bias expression such as favoritism of groups that are dominant in each language's culture.
arXiv Detail & Related papers (2023-05-18T18:15:07Z) - Sentiment-Aware Word and Sentence Level Pre-training for Sentiment
Analysis [64.70116276295609]
SentiWSP is a Sentiment-aware pre-trained language model with combined Word-level and Sentence-level Pre-training tasks.
SentiWSP achieves new state-of-the-art performance on various sentence-level and aspect-level sentiment classification benchmarks.
arXiv Detail & Related papers (2022-10-18T12:25:29Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short
Text Matching [29.318730227080675]
We introduce HowNet as an external knowledge base and propose a Linguistic knowledge Enhanced graph Transformer (LET) to deal with word ambiguity.
Experimental results on two Chinese datasets show that our models outperform various typical text matching approaches.
arXiv Detail & Related papers (2021-02-25T04:01:51Z) - End-to-End Chinese Parsing Exploiting Lexicons [15.786281545363448]
We propose an end-to-end Chinese parsing model based on character inputs which jointly learns to output word segmentation, part-of-speech tags and dependency structures.
Our parsing model relies on word-char graph attention networks, which can enrich the character inputs with external word knowledge.
arXiv Detail & Related papers (2020-12-08T12:24:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.