HashSet -- A Dataset For Hashtag Segmentation
- URL: http://arxiv.org/abs/2201.06741v1
- Date: Tue, 18 Jan 2022 04:40:45 GMT
- Title: HashSet -- A Dataset For Hashtag Segmentation
- Authors: Prashant Kodali, Akshala Bhatnagar, Naman Ahuja, Manish Shrivastava,
Ponnurangam Kumaraguru
- Abstract summary: We argue that model performance should be assessed on a wider variety of hashtags.
We propose HashSet, a dataset comprising of: a) 1.9k manually annotated dataset; b) 3.3M loosely supervised dataset.
We show that the performance of SOTA models for Hashtag drops substantially on proposed dataset.
- Score: 19.016545782774003
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hashtag segmentation is the task of breaking a hashtag into its constituent
tokens. Hashtags often encode the essence of user-generated posts, along with
information like topic and sentiment, which are useful in downstream tasks.
Hashtags prioritize brevity and are written in unique ways -- transliterating
and mixing languages, spelling variations, creative named entities. Benchmark
datasets used for the hashtag segmentation task -- STAN, BOUN -- are small in
size and extracted from a single set of tweets. However, datasets should
reflect the variations in writing styles of hashtags and also account for
domain and language specificity, failing which the results will misrepresent
model performance. We argue that model performance should be assessed on a
wider variety of hashtags, and datasets should be carefully curated. To this
end, we propose HashSet, a dataset comprising of: a) 1.9k manually annotated
dataset; b) 3.3M loosely supervised dataset. HashSet dataset is sampled from a
different set of tweets when compared to existing datasets and provides an
alternate distribution of hashtags to build and validate hashtag segmentation
models. We show that the performance of SOTA models for Hashtag Segmentation
drops substantially on proposed dataset, indicating that the proposed dataset
provides an alternate set of hashtags to train and assess models.
Related papers
- Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets [51.74296438621836]
We introduce Scribbles for All, a label and training data generation algorithm for semantic segmentation trained on scribble labels.
The main limitation of scribbles as source for weak supervision is the lack of challenging datasets for scribble segmentation.
Scribbles for All provides scribble labels for several popular segmentation datasets and provides an algorithm to automatically generate scribble labels for any dataset with dense annotations.
arXiv Detail & Related papers (2024-08-22T15:29:08Z) - Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining.
We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure.
This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z) - RIGHT: Retrieval-augmented Generation for Mainstream Hashtag
Recommendation [76.24205422163169]
We propose RetrIeval-augmented Generative Mainstream HashTag Recommender (RIGHT)
RIGHT consists of three components: 1) a retriever seeks relevant hashtags from the entire tweet-hashtags set; 2) a selector enhances mainstream identification by introducing global signals; and 3) a generator incorporates input tweets and selected hashtags to directly generate the desired hashtags.
Our method achieves significant improvements over state-of-the-art baselines. Moreover, RIGHT can be easily integrated into large language models, improving the performance of ChatGPT by more than 10%.
arXiv Detail & Related papers (2023-12-16T14:47:03Z) - #REVAL: a semantic evaluation framework for hashtag recommendation [6.746400031322727]
We propose a novel semantic evaluation framework for hashtag recommendation, called #REval.
#REval includes an internal module referred to as BERTag, which automatically learns the hashtag embeddings.
Our experiments on three large datasets show that #REval gave more meaningful hashtag synonyms for hashtag recommendation evaluation.
arXiv Detail & Related papers (2023-05-24T07:10:56Z) - Hashtag-Guided Low-Resource Tweet Classification [31.810562621519804]
We propose a novel Hashtag-guided Tweet Classification model (HashTation)
HashTation automatically generates meaningful hashtags for the input tweet to provide useful auxiliary signals for tweet classification.
Experiments show that HashTation achieves significant improvements on seven low-resource tweet classification tasks.
arXiv Detail & Related papers (2023-02-20T18:21:02Z) - Detection Hub: Unifying Object Detection Datasets via Query Adaptation
on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned.
It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets.
The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z) - Attend and Select: A Segment Attention based Selection Mechanism for
Microblog Hashtag Generation [69.73215951112452]
A hashtag is formed by tokens or phrases that may originate from various fragmentary segments of the original text.
We propose an end-to-end Transformer-based generation model which consists of three phases: encoding, segments-selection, and decoding.
We introduce two large-scale hashtag generation datasets, which are newly collected from Chinese Weibo and English Twitter.
arXiv Detail & Related papers (2021-06-06T15:13:58Z) - On Identifying Hashtags in Disaster Twitter Data [55.17975121160699]
We construct a unique dataset of disaster-related tweets annotated with hashtags useful for filtering actionable information.
Using this dataset, we investigate Long Short Term Memory-based models within a Multi-Task Learning framework.
The best performing model achieves an F1-score as high as 92.22%.
arXiv Detail & Related papers (2020-01-05T22:37:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.