Annotating the Tweebank Corpus on Named Entity Recognition and Building
NLP Models for Social Media Analysis
- URL: http://arxiv.org/abs/2201.07281v1
- Date: Tue, 18 Jan 2022 19:34:23 GMT
- Title: Annotating the Tweebank Corpus on Named Entity Recognition and Building
NLP Models for Social Media Analysis
- Authors: Hang Jiang, Yining Hua, Doug Beeferman, Deb Roy
- Abstract summary: Social media data such as Twitter messages ("tweets") pose a particular challenge to NLP systems because of their short, noisy, and colloquial nature.
We aim to create Tweebank-NER, an NER corpus based on Tweebank V2 (TB2), and we use these to train state-of-the-art NLP models.
We release the dataset and make the models available to use in an "off-the-shelf" manner for future Tweet NLP research.
- Score: 12.871968485402084
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Social media data such as Twitter messages ("tweets") pose a particular
challenge to NLP systems because of their short, noisy, and colloquial nature.
Tasks such as Named Entity Recognition (NER) and syntactic parsing require
highly domain-matched training data for good performance. While there are some
publicly available annotated datasets of tweets, they are all purpose-built for
solving one task at a time. As yet there is no complete training corpus for
both syntactic analysis (e.g., part of speech tagging, dependency parsing) and
NER of tweets. In this study, we aim to create Tweebank-NER, an NER corpus
based on Tweebank V2 (TB2), and we use these datasets to train state-of-the-art
NLP models. We first annotate named entities in TB2 using Amazon Mechanical
Turk and measure the quality of our annotations. We train a Stanza NER model on
the new benchmark, achieving competitive performance against other
non-transformer NER systems. Finally, we train other Twitter NLP models (a
tokenizer, lemmatizer, part of speech tagger, and dependency parser) on TB2
based on Stanza, and achieve state-of-the-art or competitive performance on
these tasks. We release the dataset and make the models available to use in an
"off-the-shelf" manner for future Tweet NLP research. Our source code, data,
and pre-trained models are available at:
\url{https://github.com/social-machines/TweebankNLP}.
Related papers
- NERetrieve: Dataset for Next Generation Named Entity Recognition and
Retrieval [49.827932299460514]
We argue that capabilities provided by large language models are not the end of NER research, but rather an exciting beginning.
We present three variants of the NER task, together with a dataset to support them.
We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types.
arXiv Detail & Related papers (2023-10-22T12:23:00Z) - Learning to Rank Context for Named Entity Recognition Using a Synthetic Dataset [6.633914491587503]
We propose to generate a synthetic context retrieval training dataset using Alpaca.
Using this dataset, we train a neural context retriever based on a BERT model that is able to find relevant context for NER.
We show that our method outperforms several retrieval baselines for the NER task on an English literary dataset composed of the first chapter of 40 books.
arXiv Detail & Related papers (2023-10-16T06:53:12Z) - Context-Based Tweet Engagement Prediction [0.0]
This thesis investigates how well context alone may be used to predict tweet engagement likelihood.
We employed the Spark engine on TU Wien's Little Big Data Cluster to create scalable data preprocessing, feature engineering, feature selection, and machine learning pipelines.
We also found that factors such as the prediction algorithm, training dataset size, training dataset sampling method, and feature selection significantly affect the results.
arXiv Detail & Related papers (2023-09-28T08:36:57Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - People and Places of Historical Europe: Bootstrapping Annotation
Pipeline and a New Corpus of Named Entities in Late Medieval Texts [0.0]
We develop a new NER corpus of 3.6M sentences from late medieval charters written mainly in Czech, Latin, and German.
We show that we can start with a list of known historical figures and locations and an unannotated corpus of historical texts, and use information retrieval techniques to automatically bootstrap a NER-annotated corpus.
arXiv Detail & Related papers (2023-05-26T08:05:01Z) - A Robust Semantic Frame Parsing Pipeline on a New Complex Twitter
Dataset [53.73316523766183]
We introduce a robust semantic frame parsing pipeline that can handle both emphOOD patterns and emphOOV tokens.
We also build an E2E application to demo the feasibility of our algorithm and show why it is useful in real application.
arXiv Detail & Related papers (2022-12-18T01:59:49Z) - Named Entity Recognition in Twitter: A Dataset and Analysis on
Short-Term Temporal Shifts [15.108940488494587]
We focus on NER in Twitter, one of the largest social media platforms, and construct a new NER dataset, TweetNER7.
The dataset was constructed by carefully distributing the tweets over time and taking representative trends as a basis.
In particular, we focus on three important temporal aspects in our analysis: short-term degradation of NER models over time, strategies to fine-tune a language model over different periods, and self-labeling as an alternative to lack of recently-labeled data.
arXiv Detail & Related papers (2022-10-07T19:58:47Z) - On the Use of External Data for Spoken Named Entity Recognition [40.93448412171246]
Recent advances in self-supervised speech representations have made it feasible to consider learning models with limited labeled data.
We draw on a variety of approaches, including self-training, knowledge distillation, and transfer learning, and consider their applicability to both end-to-end models and pipeline approaches.
arXiv Detail & Related papers (2021-12-14T18:49:26Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Coreferential Reasoning Learning for Language Representation [88.14248323659267]
We present CorefBERT, a novel language representation model that can capture the coreferential relations in context.
The experimental results show that, compared with existing baseline models, CorefBERT can achieve significant improvements consistently on various downstream NLP tasks.
arXiv Detail & Related papers (2020-04-15T03:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.