Related papers: What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning

What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning

URL: http://arxiv.org/abs/2511.15886v1
Date: Wed, 19 Nov 2025 21:23:58 GMT
Title: What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning
Authors: Jeremias Ferrao, Ezgi Basar, Khondoker Ittehadul Islam, Mahrokh Hassani,
Abstract summary: This study investigates the attribution patterns underlying Chain-of-Thought (CoT) reasoning in multilingual LLMs.<n>We apply two complementary attribution methods--ContextCite for step-level attribution and Inseq for token-level attribution--to the Qwen2.5 1.5B-Instruct model.<n>Our experimental results highlight key findings such as: (1) attribution scores excessively emphasize the final reasoning step, particularly in incorrect generations; (2) structured CoT prompting significantly improves accuracy for high-resource Latin-script languages; and (3) controlled perturbations via negation and distractor sentences reduce model accuracy and attribution coherence.
Score: 0.03499870393443267
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This study investigates the attribution patterns underlying Chain-of-Thought (CoT) reasoning in multilingual LLMs. While prior works demonstrate the role of CoT prompting in improving task performance, there are concerns regarding the faithfulness and interpretability of the generated reasoning chains. To assess these properties across languages, we applied two complementary attribution methods--ContextCite for step-level attribution and Inseq for token-level attribution--to the Qwen2.5 1.5B-Instruct model using the MGSM benchmark. Our experimental results highlight key findings such as: (1) attribution scores excessively emphasize the final reasoning step, particularly in incorrect generations; (2) structured CoT prompting significantly improves accuracy primarily for high-resource Latin-script languages; and (3) controlled perturbations via negation and distractor sentences reduce model accuracy and attribution coherence. These findings highlight the limitations of CoT prompting, particularly in terms of multilingual robustness and interpretive transparency.

Related papers

Context-level Language Modeling by Learning Predictive Context Embeddings [79.00607069677393]
We introduce textbfContextLM, a framework that augments standard pretraining with an inherent textbfnext-context prediction objective.<n>This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks.<n>Experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance.
arXiv Detail & Related papers (2025-10-23T07:09:45Z)
On the Entity-Level Alignment in Crosslingual Consistency [62.33186691736433]
SubSub and SubInj integrate English translations of subjects into prompts across languages, leading to substantial gains in factual recall accuracy and consistency.<n>These interventions reinforce the entity representation alignment in the conceptual space through model's internal pivot-language processing.
arXiv Detail & Related papers (2025-10-11T16:26:50Z)
Think Natively: Unlocking Multilingual Reasoning with Consistency-Enhanced Reinforcement Learning [85.7304930030649]
We propose M-Thinker, which is trained by a Language Consistency reward and a Cross-lingual Thinking Alignment reward.<n>M-Thinker achieves nearly 100% language consistency and superior performance on two multilingual benchmarks.
arXiv Detail & Related papers (2025-10-08T17:55:02Z)
Robust Native Language Identification through Agentic Decomposition [23.899157231471104]
Large language models (LLMs) often achieve high performance in native language identification (NLI) benchmarks by leveraging superficial contextual clues.<n>We show that such a strategy is unreliable and model predictions can be easily altered by misleading hints.<n>We introduce an agentic NLI pipeline inspired by forensic linguistics, where specialized agents accumulate and categorize diverse linguistic evidence.
arXiv Detail & Related papers (2025-09-20T12:38:03Z)
Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps [3.8936716676293917]
This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data.<n>We identify a critical parameter threshold (1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning.
arXiv Detail & Related papers (2025-02-21T00:48:32Z)
Estimating Commonsense Plausibility through Semantic Shifts [66.06254418551737]
We propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts.<n> Evaluations on two types of fine-grained commonsense plausibility estimation tasks show that ComPaSS consistently outperforms baselines.
arXiv Detail & Related papers (2025-02-19T06:31:06Z)
How does a Language-Specific Tokenizer affect LLMs? [0.36248657646376703]
The necessity of language-specific tokenizers intuitively appears crucial for effective natural language processing.<n>This study explores how language-specific tokenizers influence the behavior of Large Language Models predominantly trained with English text data.
arXiv Detail & Related papers (2025-02-18T05:54:56Z)
Markovian Transformers for Informative Language Modeling [1.172865818448696]
Chain-of-Thought (CoT) reasoning often fails to faithfully reflect a language model's underlying decision process.<n>We introduce a Markovian language model framework that can be understood as a reasoning autoencoder.
arXiv Detail & Related papers (2024-04-29T17:36:58Z)
Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents [80.5213198675411]
Large language models (LLMs) have dramatically enhanced the field of language intelligence. LLMs leverage the intriguing chain-of-thought (CoT) reasoning techniques, obliging them to formulate intermediate steps en route to deriving an answer. Recent research endeavors have extended CoT reasoning methodologies to nurture the development of autonomous language agents.
arXiv Detail & Related papers (2023-11-20T14:30:55Z)
Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions [10.621564997491808]
Chain-of-thought (CoT) prompting has been shown to empirically improve the accuracy of large language models. We investigate whether CoT prompting affects the relative importances they assign to particular input tokens. Our results indicate that while CoT prompting does not increase the magnitude of saliency scores attributed to semantically relevant tokens in the prompt, it increases the robustness of saliency scores to question perturbations and variations in model output.
arXiv Detail & Related papers (2023-07-25T08:51:30Z)
Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial. We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.