Related papers: ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs

ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs

URL: http://arxiv.org/abs/2603.02676v1
Date: Tue, 03 Mar 2026 07:02:45 GMT
Title: ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs
Authors: Wicaksono Leksono Muhamad, Joanito Agili Lopo, Tack Hwa Wong, Muhammad Ravi Shulthan Habibi, Samuel Cahyawijaya,
Abstract summary: Large language models suffer from content effects in reasoning tasks, particularly in multi-lingual contexts.<n>We introduce a novel method that reduces these biases through explicit structural abstraction.<n>Our approach achieves top-5 rankings across all subtasks while substantially reducing content effects.
Score: 9.363838558599863
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models suffer from content effects in reasoning tasks, particularly in multi-lingual contexts. We introduce a novel method that reduces these biases through explicit structural abstraction that transforms syllogisms into canonical logical representations and applies deterministic parsing to determine validity. Evaluated on the SemEval-2026 Task 11 multilingual benchmark, our approach achieves top-5 rankings across all subtasks while substantially reducing content effects and offering a competitive alternative to complex fine-tuning or activation-level interventions.

Related papers

What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning [0.03499870393443267]
This study investigates the attribution patterns underlying Chain-of-Thought (CoT) reasoning in multilingual LLMs.<n>We apply two complementary attribution methods--ContextCite for step-level attribution and Inseq for token-level attribution--to the Qwen2.5 1.5B-Instruct model.<n>Our experimental results highlight key findings such as: (1) attribution scores excessively emphasize the final reasoning step, particularly in incorrect generations; (2) structured CoT prompting significantly improves accuracy for high-resource Latin-script languages; and (3) controlled perturbations via negation and distractor sentences reduce model accuracy and attribution coherence.
arXiv Detail & Related papers (2025-11-19T21:23:58Z)
On the Entity-Level Alignment in Crosslingual Consistency [62.33186691736433]
SubSub and SubInj integrate English translations of subjects into prompts across languages, leading to substantial gains in factual recall accuracy and consistency.<n>These interventions reinforce the entity representation alignment in the conceptual space through model's internal pivot-language processing.
arXiv Detail & Related papers (2025-10-11T16:26:50Z)
Code-Switching In-Context Learning for Cross-Lingual Transfer of Large Language Models [64.54005959758733]
We introduce code-switching in-context learning (CSICL) as a principled and robust approach for overcoming the translation barrier during inference.<n>We conduct extensive experiments across 4 LLMs, 6 datasets, and 10 languages, spanning both knowledge-intensive and reasoning-oriented domains.<n>Our results demonstrate CSICL consistently outperforms X-ICL baselines, achieving gains of 3.1%p and 1.9%p in both target and unseen languages.
arXiv Detail & Related papers (2025-10-07T08:35:42Z)
Understanding Textual Capability Degradation in Speech LLMs via Parameter Importance Analysis [54.53152524778821]
integration of speech into Large Language Models (LLMs) has substantially expanded their capabilities, but often at the cost of weakening their core textual competence.<n>We propose an analytical framework based on parameter importance estimation, which reveals that fine-tuning for speech introduces a textual importance distribution shift.<n>We investigate two mitigation strategies: layer-wise learning rate scheduling and Low-Rank Adaptation (LoRA)<n> Experimental results show that both approaches better maintain textual competence than full fine-tuning, while also improving downstream spoken question answering performance.
arXiv Detail & Related papers (2025-09-28T09:04:40Z)
From Language to Logic: A Bi-Level Framework for Structured Reasoning [6.075080928704587]
Structured reasoning over natural language inputs remains a core challenge in artificial intelligence.<n>We propose a novel framework that maps language to logic through a two-stage process: high-level task abstraction and low-level logic generation.<n>Our approach significantly outperforms existing baselines in accuracy, with accuracy gains reaching as high as 40%.
arXiv Detail & Related papers (2025-07-11T11:24:09Z)
Fane at SemEval-2025 Task 10: Zero-Shot Entity Framing with Large Language Models [25.283401945003277]
We evaluate the zero-shot capabilities of large language models (LLMs) in classifying framing roles.<n>Our findings show that a hierarchical approach of first identifying broad roles and then fine-grained roles, outperforms single-step classification.
arXiv Detail & Related papers (2025-04-29T07:10:53Z)
Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks [71.19560970717495]
Recent language models show impressive performance across a wide range of tasks. Are these skills general and transferable, or specialized to specific tasks seen during pretraining? We propose an evaluation framework based on "counterfactual" task variants that deviate from the default assumptions underlying standard tasks.
arXiv Detail & Related papers (2023-07-05T17:50:42Z)
UCAS-IIE-NLP at SemEval-2023 Task 12: Enhancing Generalization of Multilingual BERT for Low-resource Sentiment Analysis [24.542445315345464]
This paper describes our system designed for SemEval-2023 Task 12: Sentiment analysis for African languages. Specifically, we design a lexicon-based multilingual BERT to facilitate language adaptation and sentiment-aware representation learning. Our system achieved competitive results, largely outperforming baselines on both multilingual and zero-shot sentiment classification subtasks.
arXiv Detail & Related papers (2023-06-01T19:10:09Z)
Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial. We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z)
Synergies between Disentanglement and Sparsity: Generalization and Identifiability in Multi-Task Learning [79.83792914684985]
We prove a new identifiability result that provides conditions under which maximally sparse base-predictors yield disentangled representations. Motivated by this theoretical result, we propose a practical approach to learn disentangled representations based on a sparsity-promoting bi-level optimization problem.
arXiv Detail & Related papers (2022-11-26T21:02:09Z)
PALI-NLP at SemEval-2022 Task 4: Discriminative Fine-tuning of Deep Transformers for Patronizing and Condescending Language Detection [4.883341580669763]
We propose a novel Transformer-based model and its ensembles to accurately understand such language context for PCL detection. To facilitate comprehension of the subtle and subjective nature of PCL, two fine-tuning strategies are applied. The system achieves remarkable results on the official ranking, namely 1st in Subtask 1 and 5th in Subtask 2.
arXiv Detail & Related papers (2022-03-09T10:05:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.