Zero-shot hashtag segmentation for multilingual sentiment analysis
- URL: http://arxiv.org/abs/2112.03213v1
- Date: Mon, 6 Dec 2021 18:13:46 GMT
- Title: Zero-shot hashtag segmentation for multilingual sentiment analysis
- Authors: Ruan Chaves Rodrigues, Marcelo Akira Inuzuka, Juliana Resplande
Sant'Anna Gomes, Acquila Santos Rocha, Iacer Calixto, Hugo Alexandre Dantas
do Nascimento
- Abstract summary: Hashtag segmentation, also known as hashtag decomposition, is a common step in preprocessing pipelines for social media datasets.
We develop a zero-shot hashtag segmentation framework and demonstrate how it can be used to improve the accuracy of multilingual sentiment analysis pipelines.
- Score: 1.8762753243053634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hashtag segmentation, also known as hashtag decomposition, is a common step
in preprocessing pipelines for social media datasets. It usually precedes tasks
such as sentiment analysis and hate speech detection. For sentiment analysis in
medium to low-resourced languages, previous research has demonstrated that a
multilingual approach that resorts to machine translation can be competitive or
superior to previous approaches to the task. We develop a zero-shot hashtag
segmentation framework and demonstrate how it can be used to improve the
accuracy of multilingual sentiment analysis pipelines. Our zero-shot framework
establishes a new state-of-the-art for hashtag segmentation datasets,
surpassing even previous approaches that relied on feature engineering and
language models trained on in-domain data.
Related papers
- Evaluating and explaining training strategies for zero-shot cross-lingual news sentiment analysis [8.770572911942635]
We introduce novel evaluation datasets in several less-resourced languages.
We experiment with a range of approaches including the use of machine translation.
We show that language similarity is not in itself sufficient for predicting the success of cross-lingual transfer.
arXiv Detail & Related papers (2024-09-30T07:59:41Z) - Advancing Topic Segmentation of Broadcasted Speech with Multilingual Semantic Embeddings [2.615008111842321]
We introduce an end-to-end scheme for topic segmentation using semantic speech encoders.
We propose a new benchmark for spoken news topic segmentation by utilizing a dataset featuring 1000 hours of publicly available recordings.
Our results demonstrate that while the traditional pipeline approach achieves a state-of-the-art $P_k$ score of 0.2431 for English, our end-to-end model delivers a competitive $P_k$ score of 0.2564.
arXiv Detail & Related papers (2024-09-10T05:24:36Z) - Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing [68.47787275021567]
Cross-lingual semantic parsing transfers parsing capability from a high-resource language (e.g., English) to low-resource languages with scarce training data.
We propose a new approach to cross-lingual semantic parsing by explicitly minimizing cross-lingual divergence between latent variables using Optimal Transport.
arXiv Detail & Related papers (2023-07-09T04:52:31Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - Meta-Learning a Cross-lingual Manifold for Semantic Parsing [75.26271012018861]
Localizing a semantic to support new languages requires effective cross-lingual generalization.
We introduce a first-order meta-learning algorithm to train a semantic annotated with maximal sample efficiency during cross-lingual transfer.
Results across six languages on ATIS demonstrate that our combination of steps yields accurate semantics sampling $le$10% of source training data in each new language.
arXiv Detail & Related papers (2022-09-26T10:42:17Z) - A combined approach to the analysis of speech conversations in a contact
center domain [2.575030923243061]
We describe an experimentation with a speech analytics process for an Italian contact center, that deals with call recordings extracted from inbound or outbound flows.
First, we illustrate in detail the development of an in-house speech-to-text solution, based on Kaldi framework.
Then, we evaluate and compare different approaches to the semantic tagging of call transcripts.
Finally, a decision tree inducer, called J48S, is applied to the problem of tagging.
arXiv Detail & Related papers (2022-03-12T10:03:20Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Sentiment Analysis on Brazilian Portuguese User Reviews [0.0]
This work analyzes the predictive performance of a range of document embedding strategies, assuming the polarity as the system outcome.
This analysis includes five sentiment analysis datasets in Brazilian Portuguese, unified in a single dataset, and a reference partitioning in training, testing, and validation sets, both made publicly available through a digital repository.
arXiv Detail & Related papers (2021-12-10T11:18:26Z) - Multi-view Subword Regularization [111.04350390045705]
Multi-view Subword Regularization (MVR) is a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations.
Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.
arXiv Detail & Related papers (2021-03-15T16:07:42Z) - Fine-grained Language Identification with Multilingual CapsNet Model [0.0]
There is an explosion of multilingual content generation and consumption.
There is an increasing need for real-time and fine-grained content analysis services.
Current techniques in spoken language detection may lack on one of these fronts.
arXiv Detail & Related papers (2020-07-12T20:01:22Z) - Pre-training via Paraphrasing [96.79972492585112]
We introduce MARGE, a pre-trained sequence-to-sequence model learned with an unsupervised multi-lingual paraphrasing objective.
We show it is possible to jointly learn to do retrieval and reconstruction, given only a random initialization.
For example, with no additional task-specific training we achieve BLEU scores of up to 35.8 for document translation.
arXiv Detail & Related papers (2020-06-26T14:43:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.