Text classification in shipping industry using unsupervised models and
Transformer based supervised models
- URL: http://arxiv.org/abs/2212.12407v1
- Date: Wed, 21 Dec 2022 16:00:44 GMT
- Title: Text classification in shipping industry using unsupervised models and
Transformer based supervised models
- Authors: Ying Xie and Dongping Song
- Abstract summary: We propose a novel and simple unsupervised text classification model to classify cargo content in international shipping industry.
Our method stems from representing words using pretrained Glove Word Embeddings and finding the most likely label using Cosine Similarity.
To compare unsupervised text classification model with supervised classification, we also applied several Transformer models to classify cargo content.
- Score: 1.4594704809280983
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Obtaining labelled data in a particular context could be expensive and time
consuming. Although different algorithms, including unsupervised learning,
semi-supervised learning, self-learning have been adopted, the performance of
text classification varies with context. Given the lack of labelled dataset, we
proposed a novel and simple unsupervised text classification model to classify
cargo content in international shipping industry using the Standard
International Trade Classification (SITC) codes. Our method stems from
representing words using pretrained Glove Word Embeddings and finding the most
likely label using Cosine Similarity. To compare unsupervised text
classification model with supervised classification, we also applied several
Transformer models to classify cargo content. Due to lack of training data, the
SITC numerical codes and the corresponding textual descriptions were used as
training data. A small number of manually labelled cargo content data was used
to evaluate the classification performances of the unsupervised classification
and the Transformer based supervised classification. The comparison reveals
that unsupervised classification significantly outperforms Transformer based
supervised classification even after increasing the size of the training
dataset by 30%. Lacking training data is a key bottleneck that prohibits deep
learning models (such as Transformers) from successful practical applications.
Unsupervised classification can provide an alternative efficient and effective
method to classify text when there is scarce training data.
Related papers
- Multidimensional Perceptron for Efficient and Explainable Long Text
Classification [31.31206469613901]
We propose a simple but effective model, Segment-aWare multIdimensional PErceptron (SWIPE) to replace attention/RNNs in the framework.
SWIPE can effectively learn the label of the entire text with supervised training, while perceive the labels of the segments and estimate their contributions to the long-text labeling.
arXiv Detail & Related papers (2023-04-04T08:49:39Z) - Like a Good Nearest Neighbor: Practical Content Moderation and Text
Classification [66.02091763340094]
Like a Good Nearest Neighbor (LaGoNN) is a modification to SetFit that introduces no learnable parameters but alters input text with information from its nearest neighbor.
LaGoNN is effective at flagging undesirable content and text classification, and improves the performance of SetFit.
arXiv Detail & Related papers (2023-02-17T15:43:29Z) - Label Semantic Aware Pre-training for Few-shot Text Classification [53.80908620663974]
We propose Label Semantic Aware Pre-training (LSAP) to improve the generalization and data efficiency of text classification systems.
LSAP incorporates label semantics into pre-trained generative models (T5 in our case) by performing secondary pre-training on labeled sentences from a variety of domains.
arXiv Detail & Related papers (2022-04-14T17:33:34Z) - How does a Pre-Trained Transformer Integrate Contextual Keywords?
Application to Humanitarian Computing [0.0]
This paper describes how to improve a humanitarian classification task by adding the crisis event type to each tweet to be classified.
It shows how the proposed neural network approach is partially over-fitting the particularities of the Crisis Benchmark.
arXiv Detail & Related papers (2021-11-07T11:24:08Z) - BERT got a Date: Introducing Transformers to Temporal Tagging [4.651578365545765]
We present a transformer encoder-decoder model using the RoBERTa language model as our best performing system.
Our model surpasses previous works in temporal tagging and type classification, especially on rare classes.
arXiv Detail & Related papers (2021-09-30T08:54:21Z) - Discriminative and Generative Transformer-based Models For Situation
Entity Classification [8.029049649310211]
We re-examine the situation entity (SE) classification task with varying amounts of available training data.
We exploit a Transformer-based variational autoencoder to encode sentences into a lower dimensional latent space.
arXiv Detail & Related papers (2021-09-15T17:07:07Z) - Binary Classification from Multiple Unlabeled Datasets via Surrogate Set
Classification [94.55805516167369]
We propose a new approach for binary classification from m U-sets for $mge2$.
Our key idea is to consider an auxiliary classification task called surrogate set classification (SSC)
arXiv Detail & Related papers (2021-02-01T07:36:38Z) - TF-CR: Weighting Embeddings for Text Classification [6.531659195805749]
We introduce a novel weighting scheme, Term Frequency-Category Ratio (TF-CR), which can weight high-frequency, category-exclusive words higher when computing word embeddings.
Experiments on 16 classification datasets show the effectiveness of TF-CR, leading to improved performance scores over existing weighting schemes.
arXiv Detail & Related papers (2020-12-11T19:23:28Z) - Text Classification with Few Examples using Controlled Generalization [58.971750512415134]
Current practice relies on pre-trained word embeddings to map words unseen in training to similar seen ones.
Our alternative begins with sparse pre-trained representations derived from unlabeled parsed corpora.
We show that a feed-forward network over these vectors is especially effective in low-data scenarios.
arXiv Detail & Related papers (2020-05-18T06:04:58Z) - Fine-Grained Visual Classification with Efficient End-to-end
Localization [49.9887676289364]
We present an efficient localization module that can be fused with a classification network in an end-to-end setup.
We evaluate the new model on the three benchmark datasets CUB200-2011, Stanford Cars and FGVC-Aircraft.
arXiv Detail & Related papers (2020-05-11T14:07:06Z) - Semi-Supervised Models via Data Augmentationfor Classifying Interactive
Affective Responses [85.04362095899656]
We present semi-supervised models with data augmentation (SMDA), a semi-supervised text classification system to classify interactive affective responses.
For labeled sentences, we performed data augmentations to uniform the label distributions and computed supervised loss during training process.
For unlabeled sentences, we explored self-training by regarding low-entropy predictions over unlabeled sentences as pseudo labels.
arXiv Detail & Related papers (2020-04-23T05:02:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.