Related papers: Optimal Strategies to Perform Multilingual Analysis of Social Content for a Novel Dataset in the Tourism Domain

Optimal Strategies to Perform Multilingual Analysis of Social Content for a Novel Dataset in the Tourism Domain

URL: http://arxiv.org/abs/2311.14727v1
Date: Mon, 20 Nov 2023 13:08:21 GMT
Title: Optimal Strategies to Perform Multilingual Analysis of Social Content for a Novel Dataset in the Tourism Domain
Authors: Maxime Masson, Rodrigo Agerri, Christian Sallaberry, Marie-Noelle Bessagnet, Annig Le Parc Lacayrelle and Philippe Roose
Abstract summary: We evaluate few-shot, pattern-exploiting and fine-tuning machine learning techniques on large multilingual language models. We aim to ascertain the quantity of annotated examples required to achieve good performance in 3 common NLP tasks. This work paves the way for applying NLP to new domain-specific applications.
Score: 5.848712585343905
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rising influence of social media platforms in various domains, including tourism, has highlighted the growing need for efficient and automated natural language processing (NLP) approaches to take advantage of this valuable resource. However, the transformation of multilingual, unstructured, and informal texts into structured knowledge often poses significant challenges. In this work, we evaluate and compare few-shot, pattern-exploiting and fine-tuning machine learning techniques on large multilingual language models (LLMs) to establish the best strategy to address the lack of annotated data for 3 common NLP tasks in the tourism domain: (1) Sentiment Analysis, (2) Named Entity Recognition, and (3) Fine-grained Thematic Concept Extraction (linked to a semantic resource). Furthermore, we aim to ascertain the quantity of annotated examples required to achieve good performance in those 3 tasks, addressing a common challenge encountered by NLP researchers in the construction of domain-specific datasets. Extensive experimentation on a newly collected and annotated multilingual (French, English, and Spanish) dataset composed of tourism-related tweets shows that current few-shot learning techniques allow us to obtain competitive results for all three tasks with very little annotation data: 5 tweets per label (15 in total) for Sentiment Analysis, 10% of the tweets for location detection (around 160) and 13% (200 approx.) of the tweets annotated with thematic concepts, a highly fine-grained sequence labeling task based on an inventory of 315 classes. This comparative analysis, grounded in a novel dataset, paves the way for applying NLP to new domain-specific applications, reducing the need for manual annotations and circumventing the complexities of rule-based, ad hoc solutions.

Related papers

USTCCTSU at SemEval-2024 Task 1: Reducing Anisotropy for Cross-lingual Semantic Textual Relatedness Task [17.905282052666333]
Cross-lingual semantic textual relatedness task is an important research task that addresses challenges in cross-lingual communication and text understanding. It helps establish semantic connections between different languages, crucial for downstream tasks like machine translation, multilingual information retrieval, and cross-lingual text understanding. With our approach, we achieve a 2nd score in Spanish, a 3rd in Indonesian, and multiple entries in the top ten results in the competition's track C.
arXiv Detail & Related papers (2024-11-28T08:40:14Z)
Evaluating and explaining training strategies for zero-shot cross-lingual news sentiment analysis [8.770572911942635]
We introduce novel evaluation datasets in several less-resourced languages. We experiment with a range of approaches including the use of machine translation. We show that language similarity is not in itself sufficient for predicting the success of cross-lingual transfer.
arXiv Detail & Related papers (2024-09-30T07:59:41Z)
TriSum: Learning Summarization Ability from Large Language Models with Structured Rationale [66.01943465390548]
We introduce TriSum, a framework for distilling large language models' text summarization abilities into a compact, local model. Our method enhances local model performance on various benchmarks. It also improves interpretability by providing insights into the summarization rationale.
arXiv Detail & Related papers (2024-03-15T14:36:38Z)
AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks. We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate. We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z)
Adapting Knowledge for Few-shot Table-to-Text Generation [35.59842534346997]
We propose a novel framework: Adapt-Knowledge-to-Generate (AKG) AKG adapts unlabeled domain-specific knowledge into the model, which brings at least three benefits. Our model achieves superior performance in terms of both fluency and accuracy as judged by human and automatic evaluations.
arXiv Detail & Related papers (2023-02-24T05:48:53Z)
FETA: A Benchmark for Few-Sample Task Transfer in Open-Domain Dialogue [70.65782786401257]
This work explores conversational task transfer by introducing FETA: a benchmark for few-sample task transfer in open-domain dialogue. FETA contains two underlying sets of conversations upon which there are 10 and 7 tasks annotated, enabling the study of intra-dataset task transfer. We utilize three popular language models and three learning algorithms to analyze the transferability between 132 source-target task pairs.
arXiv Detail & Related papers (2022-05-12T17:59:00Z)
Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text. Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z)
Sentiment analysis in tweets: an assessment study from classical to modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information. Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks. This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z)
Analysis and Evaluation of Language Models for Word Sense Disambiguation [18.001457030065712]
Transformer-based language models have taken many fields in NLP by storm. BERT can accurately capture high-level sense distinctions, even when a limited number of examples is available for each word sense. BERT and its derivatives dominate most of the existing evaluation benchmarks.
arXiv Detail & Related papers (2020-08-26T15:07:07Z)
Language Models are Few-Shot Learners [61.36677350504291]
We show that scaling up language models greatly improves task-agnostic, few-shot performance. We train GPT-3, an autoregressive language model with 175 billion parameters, and test its performance in the few-shot setting. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks.
arXiv Detail & Related papers (2020-05-28T17:29:03Z)
A Multi-cascaded Model with Data Augmentation for Enhanced Paraphrase Detection in Short Texts [1.6758573326215689]
We present a data augmentation strategy and a multi-cascaded model for improved paraphrase detection in short texts. Our model is both wide and deep and provides greater robustness across clean and noisy short texts.
arXiv Detail & Related papers (2019-12-27T12:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.