Named Entity Recognition in Twitter: A Dataset and Analysis on
Short-Term Temporal Shifts
- URL: http://arxiv.org/abs/2210.03797v1
- Date: Fri, 7 Oct 2022 19:58:47 GMT
- Title: Named Entity Recognition in Twitter: A Dataset and Analysis on
Short-Term Temporal Shifts
- Authors: Asahi Ushio and Leonardo Neves and Vitor Silva and Francesco Barbieri
and Jose Camacho-Collados
- Abstract summary: We focus on NER in Twitter, one of the largest social media platforms, and construct a new NER dataset, TweetNER7.
The dataset was constructed by carefully distributing the tweets over time and taking representative trends as a basis.
In particular, we focus on three important temporal aspects in our analysis: short-term degradation of NER models over time, strategies to fine-tune a language model over different periods, and self-labeling as an alternative to lack of recently-labeled data.
- Score: 15.108940488494587
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent progress in language model pre-training has led to important
improvements in Named Entity Recognition (NER). Nonetheless, this progress has
been mainly tested in well-formatted documents such as news, Wikipedia, or
scientific articles. In social media the landscape is different, in which it
adds another layer of complexity due to its noisy and dynamic nature. In this
paper, we focus on NER in Twitter, one of the largest social media platforms,
and construct a new NER dataset, TweetNER7, which contains seven entity types
annotated over 11,382 tweets from September 2019 to August 2021. The dataset
was constructed by carefully distributing the tweets over time and taking
representative trends as a basis. Along with the dataset, we provide a set of
language model baselines and perform an analysis on the language model
performance on the task, especially analyzing the impact of different time
periods. In particular, we focus on three important temporal aspects in our
analysis: short-term degradation of NER models over time, strategies to
fine-tune a language model over different periods, and self-labeling as an
alternative to lack of recently-labeled data. TweetNER7 is released publicly
(https://huggingface.co/datasets/tner/tweetner7) along with the models
fine-tuned on it (NER models have been integrated into TweetNLP and can be
found athttps://github.com/asahi417/tner/tree/master/examples/tweetner7_paper).
Related papers
- Tweet Insights: A Visualization Platform to Extract Temporal Insights
from Twitter [19.591692602304494]
This paper introduces a large collection of time series data derived from Twitter.
This data comprises the past five years and captures changes in n-gram frequency, similarity, sentiment and topic distribution.
The interface built on top of this data enables temporal analysis for detecting and characterizing shifts in meaning.
arXiv Detail & Related papers (2023-08-04T05:39:26Z) - Political Sentiment Analysis of Persian Tweets Using CNN-LSTM Model [0.356008609689971]
We present several machine learning and a deep learning model to analysis sentiment of Persian political tweets.
Deep learning with ParsBERT embedding performs better than machine learning.
arXiv Detail & Related papers (2023-07-15T08:08:38Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities.
We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention.
Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Annotating the Tweebank Corpus on Named Entity Recognition and Building
NLP Models for Social Media Analysis [12.871968485402084]
Social media data such as Twitter messages ("tweets") pose a particular challenge to NLP systems because of their short, noisy, and colloquial nature.
We aim to create Tweebank-NER, an NER corpus based on Tweebank V2 (TB2), and we use these to train state-of-the-art NLP models.
We release the dataset and make the models available to use in an "off-the-shelf" manner for future Tweet NLP research.
arXiv Detail & Related papers (2022-01-18T19:34:23Z) - Exploiting BERT For Multimodal Target SentimentClassification Through
Input Space Translation [75.82110684355979]
We introduce a two-stream model that translates images in input space using an object-aware transformer.
We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model.
We achieve state-of-the-art performance on two multimodal Twitter datasets.
arXiv Detail & Related papers (2021-08-03T18:02:38Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Few-NERD: A Few-Shot Named Entity Recognition Dataset [35.669024917327825]
We present Few-NERD, a large-scale human-annotated few-shot NER dataset with a hierarchy of 8 coarse-grained and 66 fine-grained entity types.
Few-NERD consists of 188,238 sentences from Wikipedia, 4,601,160 words are included and each is annotated as context or a part of a two-level entity type.
arXiv Detail & Related papers (2021-05-16T15:53:17Z) - RethinkCWS: Is Chinese Word Segmentation a Solved Task? [81.11161697133095]
The performance of the Chinese Word (CWS) systems has gradually reached a plateau with the rapid development of deep neural networks.
In this paper, we take stock of what we have achieved and rethink what's left in the CWS task.
arXiv Detail & Related papers (2020-11-13T11:07:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.