Optimal Strategies to Perform Multilingual Analysis of Social Content
for a Novel Dataset in the Tourism Domain
- URL: http://arxiv.org/abs/2311.14727v1
- Date: Mon, 20 Nov 2023 13:08:21 GMT
- Title: Optimal Strategies to Perform Multilingual Analysis of Social Content
for a Novel Dataset in the Tourism Domain
- Authors: Maxime Masson, Rodrigo Agerri, Christian Sallaberry, Marie-Noelle
Bessagnet, Annig Le Parc Lacayrelle and Philippe Roose
- Abstract summary: We evaluate few-shot, pattern-exploiting and fine-tuning machine learning techniques on large multilingual language models.
We aim to ascertain the quantity of annotated examples required to achieve good performance in 3 common NLP tasks.
This work paves the way for applying NLP to new domain-specific applications.
- Score: 5.848712585343905
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rising influence of social media platforms in various domains, including
tourism, has highlighted the growing need for efficient and automated natural
language processing (NLP) approaches to take advantage of this valuable
resource. However, the transformation of multilingual, unstructured, and
informal texts into structured knowledge often poses significant challenges.
In this work, we evaluate and compare few-shot, pattern-exploiting and
fine-tuning machine learning techniques on large multilingual language models
(LLMs) to establish the best strategy to address the lack of annotated data for
3 common NLP tasks in the tourism domain: (1) Sentiment Analysis, (2) Named
Entity Recognition, and (3) Fine-grained Thematic Concept Extraction (linked to
a semantic resource). Furthermore, we aim to ascertain the quantity of
annotated examples required to achieve good performance in those 3 tasks,
addressing a common challenge encountered by NLP researchers in the
construction of domain-specific datasets.
Extensive experimentation on a newly collected and annotated multilingual
(French, English, and Spanish) dataset composed of tourism-related tweets shows
that current few-shot learning techniques allow us to obtain competitive
results for all three tasks with very little annotation data: 5 tweets per
label (15 in total) for Sentiment Analysis, 10% of the tweets for location
detection (around 160) and 13% (200 approx.) of the tweets annotated with
thematic concepts, a highly fine-grained sequence labeling task based on an
inventory of 315 classes.
This comparative analysis, grounded in a novel dataset, paves the way for
applying NLP to new domain-specific applications, reducing the need for manual
annotations and circumventing the complexities of rule-based, ad hoc solutions.
Related papers
- Autoencoder-Based Framework to Capture Vocabulary Quality in NLP [2.41710192205034]
We introduce an autoencoder-based framework that uses neural network capacity as a proxy for vocabulary richness, diversity, and complexity.<n>We validate our approach on two distinct datasets: the DIFrauD dataset, which spans multiple domains of deceptive and fraudulent text, and the Project Gutenberg dataset, representing diverse languages, genres, and historical periods.
arXiv Detail & Related papers (2025-02-28T21:45:28Z) - USTCCTSU at SemEval-2024 Task 1: Reducing Anisotropy for Cross-lingual Semantic Textual Relatedness Task [17.905282052666333]
Cross-lingual semantic textual relatedness task is an important research task that addresses challenges in cross-lingual communication and text understanding.
It helps establish semantic connections between different languages, crucial for downstream tasks like machine translation, multilingual information retrieval, and cross-lingual text understanding.
With our approach, we achieve a 2nd score in Spanish, a 3rd in Indonesian, and multiple entries in the top ten results in the competition's track C.
arXiv Detail & Related papers (2024-11-28T08:40:14Z) - Evaluating and explaining training strategies for zero-shot cross-lingual news sentiment analysis [8.770572911942635]
We introduce novel evaluation datasets in several less-resourced languages.
We experiment with a range of approaches including the use of machine translation.
We show that language similarity is not in itself sufficient for predicting the success of cross-lingual transfer.
arXiv Detail & Related papers (2024-09-30T07:59:41Z) - Zero-shot prompt-based classification: topic labeling in times of foundation models in German Tweets [1.734165485480267]
We propose a new tool for automatically annotating text using written guidelines without providing training samples.
Our results show that the prompt-based approach is comparable with the fine-tuned BERT but without any annotated training data.
Our findings emphasize the ongoing paradigm shift in the NLP landscape, i.e., the unification of downstream tasks and elimination of the need for pre-labeled training data.
arXiv Detail & Related papers (2024-06-26T10:44:02Z) - Specifying Genericity through Inclusiveness and Abstractness Continuous Scales [1.024113475677323]
This paper introduces a novel annotation framework for the fine-grained modeling of Noun Phrases' (NPs) genericity in natural language.
The framework is designed to be simple and intuitive, making it accessible to non-expert annotators and suitable for crowd-sourced tasks.
arXiv Detail & Related papers (2024-03-22T15:21:07Z) - TriSum: Learning Summarization Ability from Large Language Models with Structured Rationale [66.01943465390548]
We introduce TriSum, a framework for distilling large language models' text summarization abilities into a compact, local model.
Our method enhances local model performance on various benchmarks.
It also improves interpretability by providing insights into the summarization rationale.
arXiv Detail & Related papers (2024-03-15T14:36:38Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Adapting Knowledge for Few-shot Table-to-Text Generation [35.59842534346997]
We propose a novel framework: Adapt-Knowledge-to-Generate (AKG)
AKG adapts unlabeled domain-specific knowledge into the model, which brings at least three benefits.
Our model achieves superior performance in terms of both fluency and accuracy as judged by human and automatic evaluations.
arXiv Detail & Related papers (2023-02-24T05:48:53Z) - FETA: A Benchmark for Few-Sample Task Transfer in Open-Domain Dialogue [70.65782786401257]
This work explores conversational task transfer by introducing FETA: a benchmark for few-sample task transfer in open-domain dialogue.
FETA contains two underlying sets of conversations upon which there are 10 and 7 tasks annotated, enabling the study of intra-dataset task transfer.
We utilize three popular language models and three learning algorithms to analyze the transferability between 132 source-target task pairs.
arXiv Detail & Related papers (2022-05-12T17:59:00Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - Annotation Curricula to Implicitly Train Non-Expert Annotators [56.67768938052715]
voluntary studies often require annotators to familiarize themselves with the task, its annotation scheme, and the data domain.
This can be overwhelming in the beginning, mentally taxing, and induce errors into the resulting annotations.
We propose annotation curricula, a novel approach to implicitly train annotators.
arXiv Detail & Related papers (2021-06-04T09:48:28Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - Analysis and Evaluation of Language Models for Word Sense Disambiguation [18.001457030065712]
Transformer-based language models have taken many fields in NLP by storm.
BERT can accurately capture high-level sense distinctions, even when a limited number of examples is available for each word sense.
BERT and its derivatives dominate most of the existing evaluation benchmarks.
arXiv Detail & Related papers (2020-08-26T15:07:07Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - Language Models are Few-Shot Learners [61.36677350504291]
We show that scaling up language models greatly improves task-agnostic, few-shot performance.
We train GPT-3, an autoregressive language model with 175 billion parameters, and test its performance in the few-shot setting.
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks.
arXiv Detail & Related papers (2020-05-28T17:29:03Z) - Learning Cross-Context Entity Representations from Text [9.981223356176496]
We investigate the use of a fill-in-the-blank task to learn context independent representations of entities from text contexts.
We show that large scale training of neural models allows us to learn high quality entity representations.
Our global entity representations encode fine-grained type categories, such as Scottish footballers, and can answer trivia questions.
arXiv Detail & Related papers (2020-01-11T15:30:56Z) - A Multi-cascaded Model with Data Augmentation for Enhanced Paraphrase
Detection in Short Texts [1.6758573326215689]
We present a data augmentation strategy and a multi-cascaded model for improved paraphrase detection in short texts.
Our model is both wide and deep and provides greater robustness across clean and noisy short texts.
arXiv Detail & Related papers (2019-12-27T12:10:10Z) - Exploring the Limits of Transfer Learning with a Unified Text-to-Text
Transformer [64.22926988297685]
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP)
In this paper, we explore the landscape of introducing transfer learning techniques for NLP by a unified framework that converts all text-based language problems into a text-to-text format.
arXiv Detail & Related papers (2019-10-23T17:37:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.