Extending Czech Aspect-Based Sentiment Analysis with Opinion Terms: Dataset and LLM Benchmarks
- URL: http://arxiv.org/abs/2602.22730v2
- Date: Wed, 04 Mar 2026 09:21:05 GMT
- Title: Extending Czech Aspect-Based Sentiment Analysis with Opinion Terms: Dataset and LLM Benchmarks
- Authors: Jakub Šmíd, Pavel Přibáň, Pavel Král,
- Abstract summary: This paper introduces a novel Czech dataset in the restaurant domain for aspect-based sentiment analysis (ABSA)<n>We conduct extensive experiments using modern Transformer-based models, including large language models (LLMs) in monolingual, cross-lingual, and multilingual settings.<n>A detailed error analysis reveals key challenges, including the detection of subtle opinion terms and nuanced sentiment expressions.
- Score: 1.9779500088459443
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper introduces a novel Czech dataset in the restaurant domain for aspect-based sentiment analysis (ABSA), enriched with annotations of opinion terms. The dataset supports three distinct ABSA tasks involving opinion terms, accommodating varying levels of complexity. Leveraging this dataset, we conduct extensive experiments using modern Transformer-based models, including large language models (LLMs), in monolingual, cross-lingual, and multilingual settings. To address cross-lingual challenges, we propose a translation and label alignment methodology leveraging LLMs, which yields consistent improvements. Our results highlight the strengths and limitations of state-of-the-art models, especially when handling the linguistic intricacies of low-resource languages like Czech. A detailed error analysis reveals key challenges, including the detection of subtle opinion terms and nuanced sentiment expressions. The dataset establishes a new benchmark for Czech ABSA, and our proposed translation-alignment approach offers a scalable solution for adapting ABSA resources to other low-resource languages.
Related papers
- BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR [0.06363400715351396]
This work presents a Bangla IR dataset constructed using a BETA-labeling framework.<n>We examine whether IR datasets from other low-resource languages can be effectively reused through one-hop machine translation.
arXiv Detail & Related papers (2026-02-16T06:04:04Z) - Limited Linguistic Diversity in Embodied AI Datasets [6.956496363213419]
We present a systematic dataset audit of several widely used Vision-Language-Action (VLA) datasets.<n>We quantify instruction language along complementary dimensions-including lexical variety, duplication and overlap, semantic similarity, and syntactic complexity.
arXiv Detail & Related papers (2026-01-06T16:06:47Z) - Improving Generative Cross-lingual Aspect-Based Sentiment Analysis with Constrained Decoding [0.8602553195689511]
This paper introduces a novel approach using constrained decoding with sequence-to-sequence models.<n>It improves cross-lingual performance by 5% on average for the most complex task.<n>We evaluate our approach across seven languages and six ABSA tasks.
arXiv Detail & Related papers (2025-08-14T06:07:53Z) - LACA: Improving Cross-lingual Aspect-Based Sentiment Analysis with LLM Data Augmentation [0.8602553195689511]
Cross-lingual aspect-based sentiment analysis involves detailed sentiment analysis in a target language.<n>Most existing methods depend heavily on often unreliable translation tools to bridge the language gap.<n>We propose a new approach that leverages a large language model to generate high-quality pseudo-labelled data in the target language.
arXiv Detail & Related papers (2025-08-13T05:55:48Z) - Czech Dataset for Complex Aspect-Based Sentiment Analysis Tasks [0.7874708385247352]
This paper introduces a novel dataset for aspect-based sentiment analysis (ABSA)<n>It consists of 3.1K manually annotated reviews from the restaurant domain.<n>We provide 24M reviews without annotations suitable for unsupervised learning.
arXiv Detail & Related papers (2025-08-11T16:03:28Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
Cross-lingual summarization (MS) aims at generating a concise summary in a different target language.
To collect large-scale CLS data, existing datasets typically involve translation in their creation.
In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
arXiv Detail & Related papers (2022-12-14T13:41:49Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Transformer-based Multi-Aspect Modeling for Multi-Aspect Multi-Sentiment
Analysis [56.893393134328996]
We propose a novel Transformer-based Multi-aspect Modeling scheme (TMM), which can capture potential relations between multiple aspects and simultaneously detect the sentiment of all aspects in a sentence.
Our method achieves noticeable improvements compared with strong baselines such as BERT and RoBERTa.
arXiv Detail & Related papers (2020-11-01T11:06:31Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.