HashSet -- A Dataset For Hashtag Segmentation
- URL: http://arxiv.org/abs/2201.06741v1
- Date: Tue, 18 Jan 2022 04:40:45 GMT
- Title: HashSet -- A Dataset For Hashtag Segmentation
- Authors: Prashant Kodali, Akshala Bhatnagar, Naman Ahuja, Manish Shrivastava,
Ponnurangam Kumaraguru
- Abstract summary: We argue that model performance should be assessed on a wider variety of hashtags.
We propose HashSet, a dataset comprising of: a) 1.9k manually annotated dataset; b) 3.3M loosely supervised dataset.
We show that the performance of SOTA models for Hashtag drops substantially on proposed dataset.
- Score: 19.016545782774003
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hashtag segmentation is the task of breaking a hashtag into its constituent
tokens. Hashtags often encode the essence of user-generated posts, along with
information like topic and sentiment, which are useful in downstream tasks.
Hashtags prioritize brevity and are written in unique ways -- transliterating
and mixing languages, spelling variations, creative named entities. Benchmark
datasets used for the hashtag segmentation task -- STAN, BOUN -- are small in
size and extracted from a single set of tweets. However, datasets should
reflect the variations in writing styles of hashtags and also account for
domain and language specificity, failing which the results will misrepresent
model performance. We argue that model performance should be assessed on a
wider variety of hashtags, and datasets should be carefully curated. To this
end, we propose HashSet, a dataset comprising of: a) 1.9k manually annotated
dataset; b) 3.3M loosely supervised dataset. HashSet dataset is sampled from a
different set of tweets when compared to existing datasets and provides an
alternate distribution of hashtags to build and validate hashtag segmentation
models. We show that the performance of SOTA models for Hashtag Segmentation
drops substantially on proposed dataset, indicating that the proposed dataset
provides an alternate set of hashtags to train and assess models.
Related papers
- RIGHT: Retrieval-augmented Generation for Mainstream Hashtag
Recommendation [76.24205422163169]
We propose RetrIeval-augmented Generative Mainstream HashTag Recommender (RIGHT)
RIGHT consists of three components: 1) a retriever seeks relevant hashtags from the entire tweet-hashtags set; 2) a selector enhances mainstream identification by introducing global signals; and 3) a generator incorporates input tweets and selected hashtags to directly generate the desired hashtags.
Our method achieves significant improvements over state-of-the-art baselines. Moreover, RIGHT can be easily integrated into large language models, improving the performance of ChatGPT by more than 10%.
arXiv Detail & Related papers (2023-12-16T14:47:03Z) - DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion
Models [61.906934570771256]
We present a generic dataset generation model that can produce diverse synthetic images and perception annotations.
Our method builds upon the pre-trained diffusion model and extends text-guided image synthesis to perception data generation.
We show that the rich latent code of the diffusion model can be effectively decoded as accurate perception annotations using a decoder module.
arXiv Detail & Related papers (2023-08-11T14:38:11Z) - #REVAL: a semantic evaluation framework for hashtag recommendation [6.746400031322727]
We propose a novel semantic evaluation framework for hashtag recommendation, called #REval.
#REval includes an internal module referred to as BERTag, which automatically learns the hashtag embeddings.
Our experiments on three large datasets show that #REval gave more meaningful hashtag synonyms for hashtag recommendation evaluation.
arXiv Detail & Related papers (2023-05-24T07:10:56Z) - Disambiguation of Company names via Deep Recurrent Networks [101.90357454833845]
We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings.
We analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline.
arXiv Detail & Related papers (2023-03-07T15:07:57Z) - Hashtag-Guided Low-Resource Tweet Classification [31.810562621519804]
We propose a novel Hashtag-guided Tweet Classification model (HashTation)
HashTation automatically generates meaningful hashtags for the input tweet to provide useful auxiliary signals for tweet classification.
Experiments show that HashTation achieves significant improvements on seven low-resource tweet classification tasks.
arXiv Detail & Related papers (2023-02-20T18:21:02Z) - Detection Hub: Unifying Object Detection Datasets via Query Adaptation
on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned.
It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets.
The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z) - Attend and Select: A Segment Attention based Selection Mechanism for
Microblog Hashtag Generation [69.73215951112452]
A hashtag is formed by tokens or phrases that may originate from various fragmentary segments of the original text.
We propose an end-to-end Transformer-based generation model which consists of three phases: encoding, segments-selection, and decoding.
We introduce two large-scale hashtag generation datasets, which are newly collected from Chinese Weibo and English Twitter.
arXiv Detail & Related papers (2021-06-06T15:13:58Z) - On Identifying Hashtags in Disaster Twitter Data [55.17975121160699]
We construct a unique dataset of disaster-related tweets annotated with hashtags useful for filtering actionable information.
Using this dataset, we investigate Long Short Term Memory-based models within a Multi-Task Learning framework.
The best performing model achieves an F1-score as high as 92.22%.
arXiv Detail & Related papers (2020-01-05T22:37:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.