Related papers: Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque

Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque

URL: http://arxiv.org/abs/2602.14812v1
Date: Mon, 16 Feb 2026 15:04:35 GMT
Title: Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque
Authors: Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri,
Abstract summary: This paper presents BasPhyCo, the first non-QA physical commonsense reasoning dataset for Basque.<n>We evaluate model performance across three levels of commonsense understanding.<n>Results indicate that, in terms of verifiability, LLMs exhibit limited physical commonsense capabilities in low-resource languages such as Basque.
Score: 10.575017227616124
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Physical commonsense reasoning represents a fundamental capability of human intelligence, enabling individuals to understand their environment, predict future events, and navigate physical spaces. Recent years have witnessed growing interest in reasoning tasks within Natural Language Processing (NLP). However, no prior research has examined the performance of Large Language Models (LLMs) on non-question-answering (non-QA) physical commonsense reasoning tasks in low-resource languages such as Basque. Taking the Italian GITA as a starting point, this paper addresses this gap by presenting BasPhyCo, the first non-QA physical commonsense reasoning dataset for Basque, available in both standard and dialectal variants. We evaluate model performance across three hierarchical levels of commonsense understanding: (1) distinguishing between plausible and implausible narratives (accuracy), (2) identifying the conflicting element that renders a narrative implausible (consistency), and (3) determining the specific physical state that creates the implausibility (verifiability). These tasks were assessed using multiple multilingual LLMs as well as models pretrained specifically for Italian and Basque. Results indicate that, in terms of verifiability, LLMs exhibit limited physical commonsense capabilities in low-resource languages such as Basque, especially when processing dialectal variants.

Related papers

Beyond Early-Token Bias: Model-Specific and Language-Specific Position Effects in Multilingual LLMs [50.07451351559251]
We present a study across five typologically distinct languages (English, Russian, German, Hindi, and Vietnamese)<n>We examine how position bias interacts with prompt strategies and affects output entropy.
arXiv Detail & Related papers (2025-05-22T02:23:00Z)
A Case Study of Cross-Lingual Zero-Shot Generalization for Classical Languages in LLMs [3.4020284996081216]
We focus on natural language understanding in three classical languages -- Sanskrit, Ancient Greek and Latin.<n>First, we explore named entity recognition and machine translation into English.<n>We show that incorporating context via retrieval-augmented generation approach significantly boosts performance.
arXiv Detail & Related papers (2025-05-19T14:30:10Z)
Can Language Models Learn Typologically Implausible Languages? [62.823015163987996]
Grammatical features across human languages show intriguing correlations often attributed to learning biases in humans.<n>We discuss how language models (LMs) allow us to better determine the role of domain-general learning biases in language universals.<n>We test LMs on an array of highly naturalistic but counterfactual versions of the English (head-initial) and Japanese (head-final) languages.
arXiv Detail & Related papers (2025-02-17T20:40:01Z)
Randomly Sampled Language Reasoning Problems Elucidate Limitations of In-Context Learning [9.75748930802634]
We study the power of in-context-learning to improve machine learning performance.<n>We consider an extremely simple domain: next token prediction on simple language tasks.<n>We find that LLMs uniformly underperform n-gram models on this task.
arXiv Detail & Related papers (2025-01-06T07:57:51Z)
Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [68.33068005789116]
We introduce ReDial, a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE.<n>We evaluate widely used models, including GPT, Claude, Llama, Mistral, and the Phi model families.<n>Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries.
arXiv Detail & Related papers (2024-10-14T18:44:23Z)
Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z)
SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning [44.53966523376327]
SeaEval is a benchmark for multilingual foundation models. We characterize how these models understand and reason with natural language. We also investigate how well they comprehend cultural practices, nuances, and values.
arXiv Detail & Related papers (2023-09-09T11:42:22Z)
Do language models learn typicality judgments from text? [6.252236971703546]
We evaluate predictive language models (LMs) on a prevalent phenomenon in cognitive science: typicality. Our first test targets whether typicality modulates LMs in assigning taxonomic category memberships to items. The second test investigates sensitivities to typicality in LMs' probabilities when extending new information about items to their categories.
arXiv Detail & Related papers (2021-05-06T21:56:40Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
Learning to Learn Morphological Inflection for Resource-Poor Languages [105.11499402984482]
We propose to cast the task of morphological inflection - mapping a lemma to an indicated inflected form - for resource-poor languages as a meta-learning problem. Treating each language as a separate task, we use data from high-resource source languages to learn a set of model parameters. Experiments with two model architectures on 29 target languages from 3 families show that our suggested approach outperforms all baselines.
arXiv Detail & Related papers (2020-04-28T05:13:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.