Related papers: Using Word Embeddings to Analyze Protests News

Using Word Embeddings to Analyze Protests News

URL: http://arxiv.org/abs/2203.05875v1
Date: Fri, 11 Mar 2022 12:25:59 GMT
Title: Using Word Embeddings to Analyze Protests News
Authors: Maria Alejandra Cardoza Ceron
Abstract summary: Two well performing models have been chosen in order to replace the existing word embeddings word2vec and FastTest with ELMo and DistilBERT. Unlike bag of words or earlier vector approaches, ELMo and DistilBERT represent words as a sequence of vectors by capturing the meaning based on contextual information in the text.
Score: 2.024222101808971
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The first two tasks of the CLEF 2019 ProtestNews events focused on distinguishing between protest and non-protest related news articles and sentences in a binary classification task. Among the submissions, two well performing models have been chosen in order to replace the existing word embeddings word2vec and FastTest with ELMo and DistilBERT. Unlike bag of words or earlier vector approaches, ELMo and DistilBERT represent words as a sequence of vectors by capturing the meaning based on contextual information in the text. Without changing the architecture of the original models other than the word embeddings, the implementation of DistilBERT improved the performance measured on the F1-Score of 0.66 compared to the FastText implementation. DistilBERT also outperformed ELMo in both tasks and models. Cleaning the datasets by removing stopwords and lemmatizing the words has been shown to make the models more generalizable across different contexts when training on a dataset with Indian news articles and evaluating the models on a dataset with news articles from China.

Related papers

Analyzing the Generalizability of Deep Contextualized Language Representations For Text Classification [0.0]
This study evaluates the robustness of two state-of-the-art deep contextual language representations, ELMo and DistilBERT. In the news classification task, the models are developed on local news from India and tested on the local news from China. In the sentiment analysis task, the models are trained on movie reviews and tested on customer reviews.
arXiv Detail & Related papers (2023-03-22T22:31:09Z)
Text Detoxification using Large Pre-trained Neural Models [57.72086777177844]
We present two novel unsupervised methods for eliminating toxicity in text. First method combines guidance of the generation process with small style-conditional language models. Second method uses BERT to replace toxic words with their non-offensive synonyms.
arXiv Detail & Related papers (2021-09-18T11:55:32Z)
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment [68.08689660963468]
A new algorithm called Token-Aware Cascade contrastive learning (TACo) improves contrastive learning using two novel techniques. We set new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.
arXiv Detail & Related papers (2021-08-23T07:24:57Z)
W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training [49.47516627019855]
w2v-BERT is a framework that combines contrastive learning and pre-supervised speech learning. Our experiments show that w2v-BERT achieves competitive results compared to current state-of-the-art pre-trained models.
arXiv Detail & Related papers (2021-08-07T06:29:36Z)
Word2rate: training and evaluating multiple word embeddings as statistical transitions [4.350783459690612]
We introduce a novel left-right context split objective that improves performance for tasks sensitive to word order. Our Word2rate model is grounded in a statistical foundation using rate matrices while being competitive in variety of language tasks.
arXiv Detail & Related papers (2021-04-16T15:31:29Z)
Representing ELMo embeddings as two-dimensional text online [5.1525653500591995]
We describe a new addition to the Web embeddings toolkit which is used to serve word embedding models over the Web. The new ELMoViz module adds support for contextualized embedding architectures, in particular for ELMo models. The provided visualizations follow the metaphor of two-dimensional text' by showing lexical substitutes: words which are most semantically similar in context to the words of the input sentence.
arXiv Detail & Related papers (2021-03-30T15:12:29Z)
Improving Text Generation with Student-Forcing Optimal Transport [122.11881937642401]
We propose using optimal transport (OT) to match the sequences generated in training and testing modes. An extension is also proposed to improve the OT learning, based on the structural and contextual information of the text sequences. The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
arXiv Detail & Related papers (2020-10-12T19:42:25Z)
LTIatCMU at SemEval-2020 Task 11: Incorporating Multi-Level Features for Multi-Granular Propaganda Span Identification [70.1903083747775]
This paper describes our submission for the task of Propaganda Span Identification in news articles. We introduce a BERT-BiLSTM based span-level propaganda classification model that identifies which token spans within the sentence are indicative of propaganda.
arXiv Detail & Related papers (2020-08-11T16:14:47Z)
Attention Word Embedding [23.997145283950346]
We introduce the Attention Word Embedding (AWE) model, which integrates the attention mechanism into the CBOW model. We also propose AWE-S, which incorporates subword information. We demonstrate that AWE and AWE-S outperform the state-of-the-art word embedding models both on a variety of word similarity datasets.
arXiv Detail & Related papers (2020-06-01T14:47:48Z)
Learning to Select Bi-Aspect Information for Document-Scale Text Content Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer. In detail, the input is a set of structured records and a reference text for describing another recordset. The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.