Related papers: Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis

Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis

URL: http://arxiv.org/abs/2201.07281v1
Date: Tue, 18 Jan 2022 19:34:23 GMT
Title: Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis
Authors: Hang Jiang, Yining Hua, Doug Beeferman, Deb Roy
Abstract summary: Social media data such as Twitter messages ("tweets") pose a particular challenge to NLP systems because of their short, noisy, and colloquial nature. We aim to create Tweebank-NER, an NER corpus based on Tweebank V2 (TB2), and we use these to train state-of-the-art NLP models. We release the dataset and make the models available to use in an "off-the-shelf" manner for future Tweet NLP research.
Score: 12.871968485402084
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Social media data such as Twitter messages ("tweets") pose a particular challenge to NLP systems because of their short, noisy, and colloquial nature. Tasks such as Named Entity Recognition (NER) and syntactic parsing require highly domain-matched training data for good performance. While there are some publicly available annotated datasets of tweets, they are all purpose-built for solving one task at a time. As yet there is no complete training corpus for both syntactic analysis (e.g., part of speech tagging, dependency parsing) and NER of tweets. In this study, we aim to create Tweebank-NER, an NER corpus based on Tweebank V2 (TB2), and we use these datasets to train state-of-the-art NLP models. We first annotate named entities in TB2 using Amazon Mechanical Turk and measure the quality of our annotations. We train a Stanza NER model on the new benchmark, achieving competitive performance against other non-transformer NER systems. Finally, we train other Twitter NLP models (a tokenizer, lemmatizer, part of speech tagger, and dependency parser) on TB2 based on Stanza, and achieve state-of-the-art or competitive performance on these tasks. We release the dataset and make the models available to use in an "off-the-shelf" manner for future Tweet NLP research. Our source code, data, and pre-trained models are available at: \url{https://github.com/social-machines/TweebankNLP}.

Related papers

"I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities [59.22329574700317]
Spoken named entity recognition (NER) aims to identify named entities from speech. New named entities appear every day, however, annotating their Spoken NER data is costly. We propose a method for generating Spoken NER data based on a named entity dictionary (NED) to reduce costs.
arXiv Detail & Related papers (2024-12-26T07:43:18Z)
NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval [49.827932299460514]
We argue that capabilities provided by large language models are not the end of NER research, but rather an exciting beginning. We present three variants of the NER task, together with a dataset to support them. We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types.
arXiv Detail & Related papers (2023-10-22T12:23:00Z)
Learning to Rank Context for Named Entity Recognition Using a Synthetic Dataset [6.633914491587503]
We propose to generate a synthetic context retrieval training dataset using Alpaca. Using this dataset, we train a neural context retriever based on a BERT model that is able to find relevant context for NER. We show that our method outperforms several retrieval baselines for the NER task on an English literary dataset composed of the first chapter of 40 books.
arXiv Detail & Related papers (2023-10-16T06:53:12Z)
Context-Based Tweet Engagement Prediction [0.0]
This thesis investigates how well context alone may be used to predict tweet engagement likelihood. We employed the Spark engine on TU Wien's Little Big Data Cluster to create scalable data preprocessing, feature engineering, feature selection, and machine learning pipelines. We also found that factors such as the prediction algorithm, training dataset size, training dataset sampling method, and feature selection significantly affect the results.
arXiv Detail & Related papers (2023-09-28T08:36:57Z)
Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases. Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding. This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z)
People and Places of Historical Europe: Bootstrapping Annotation Pipeline and a New Corpus of Named Entities in Late Medieval Texts [0.0]
We develop a new NER corpus of 3.6M sentences from late medieval charters written mainly in Czech, Latin, and German. We show that we can start with a list of known historical figures and locations and an unannotated corpus of historical texts, and use information retrieval techniques to automatically bootstrap a NER-annotated corpus.
arXiv Detail & Related papers (2023-05-26T08:05:01Z)
A Robust Semantic Frame Parsing Pipeline on a New Complex Twitter Dataset [53.73316523766183]
We introduce a robust semantic frame parsing pipeline that can handle both emphOOD patterns and emphOOV tokens. We also build an E2E application to demo the feasibility of our algorithm and show why it is useful in real application.
arXiv Detail & Related papers (2022-12-18T01:59:49Z)
Named Entity Recognition in Twitter: A Dataset and Analysis on Short-Term Temporal Shifts [15.108940488494587]
We focus on NER in Twitter, one of the largest social media platforms, and construct a new NER dataset, TweetNER7. The dataset was constructed by carefully distributing the tweets over time and taking representative trends as a basis. In particular, we focus on three important temporal aspects in our analysis: short-term degradation of NER models over time, strategies to fine-tune a language model over different periods, and self-labeling as an alternative to lack of recently-labeled data.
arXiv Detail & Related papers (2022-10-07T19:58:47Z)
On the Use of External Data for Spoken Named Entity Recognition [40.93448412171246]
Recent advances in self-supervised speech representations have made it feasible to consider learning models with limited labeled data. We draw on a variety of approaches, including self-training, knowledge distillation, and transfer learning, and consider their applicability to both end-to-end models and pipeline approaches.
arXiv Detail & Related papers (2021-12-14T18:49:26Z)
Sentiment analysis in tweets: an assessment study from classical to modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information. Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks. This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z)
Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting. Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking. We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
Coreferential Reasoning Learning for Language Representation [88.14248323659267]
We present CorefBERT, a novel language representation model that can capture the coreferential relations in context. The experimental results show that, compared with existing baseline models, CorefBERT can achieve significant improvements consistently on various downstream NLP tasks.
arXiv Detail & Related papers (2020-04-15T03:57:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.