Related papers: Triangulation as an Acceptance Rule for Multilingual Mechanistic Interpretability

Triangulation as an Acceptance Rule for Multilingual Mechanistic Interpretability

URL: http://arxiv.org/abs/2512.24842v1
Date: Wed, 31 Dec 2025 13:03:34 GMT
Title: Triangulation as an Acceptance Rule for Multilingual Mechanistic Interpretability
Authors: Yanan Long,
Abstract summary: We argue that mechanistic explanations for such models should satisfy a emphcausal standard.<n>Claims must survive causal interventions and must emphcross-reference across environments that perturb surface form while preserving meaning.<n>We ground triangulation in causal abstraction by casting it as an approximate transformation score over a distribution of interchange interventions, connect it to the pragmatic interpretability agenda, and present a comparative experimental protocol across model families, language pairs, and tasks.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Multilingual language models achieve strong aggregate performance yet often behave unpredictably across languages, scripts, and cultures. We argue that mechanistic explanations for such models should satisfy a \emph{causal} standard: claims must survive causal interventions and must \emph{cross-reference} across environments that perturb surface form while preserving meaning. We formalize \emph{reference families} as predicate-preserving variants and introduce \emph{triangulation}, an acceptance rule requiring necessity (ablating the circuit degrades the target behavior), sufficiency (patching activations transfers the behavior), and invariance (both effects remain directionally stable and of sufficient magnitude across the reference family). To supply candidate subgraphs, we adopt automatic circuit discovery and \emph{accept or reject} those candidates by triangulation. We ground triangulation in causal abstraction by casting it as an approximate transformation score over a distribution of interchange interventions, connect it to the pragmatic interpretability agenda, and present a comparative experimental protocol across multiple model families, language pairs, and tasks. Triangulation provides a falsifiable standard for mechanistic claims that filters spurious circuits passing single-environment tests but failing cross-lingual invariance.

Related papers

Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages [0.22009842278462158]
Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability.<n>We investigate evaluation reliability by holding generation conditions constant while varying target language.<n>Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages.
arXiv Detail & Related papers (2026-02-02T16:27:32Z)
DVD: A Robust Method for Detecting Variant Contamination in Large Language Model Evaluation [24.086354908256293]
textbfDVD is a single-sample detector that models the local output distribution induced by temperature sampling.<n>We construct the first benchmark for variant contamination across two domains Omni-MATH and SuperGPQA.<n>textbfDVD consistently outperforms perplexity-based, Min-$k$%++, edit-distance (CDD), and embedding-similarity baselines.
arXiv Detail & Related papers (2026-01-08T12:48:40Z)
Conditions for Catastrophic Forgetting in Multilingual Translation [24.10629800866219]
We identify the conditions that trigger catastrophic forgetting in multilingual fine-tuning.<n>We show that the relative scale between model and data size is a primary determinant of forgetting.<n>We also show that cross-lingual alignment can mitigate forgetting while also facilitating positive transfer to unseen target languages.
arXiv Detail & Related papers (2025-10-22T12:54:00Z)
Conformal Linguistic Calibration: Trading-off between Factuality and Specificity [41.45862052156885]
We present a framework connecting abstention and linguistic calibration through the lens of linguistic pragmatics.<n>Results demonstrate our method produces calibrated outputs with conformal guarantees on factual accuracy.
arXiv Detail & Related papers (2025-02-26T13:01:49Z)
On the Efficacy of Sampling Adapters [82.5941326570812]
We propose a unified framework for understanding sampling adapters. We argue that the shift they enforce can be viewed as a trade-off between precision and recall. We find that several precision-emphasizing measures indeed indicate that sampling adapters can lead to probability distributions more aligned with the true distribution.
arXiv Detail & Related papers (2023-07-07T17:59:12Z)
Conformal Language Modeling [61.94417935386489]
We propose a novel approach to conformal prediction for generative language models (LMs) Standard conformal prediction produces prediction sets with rigorous, statistical guarantees. We demonstrate the promise of our approach on multiple tasks in open-domain question answering, text summarization, and radiology report generation.
arXiv Detail & Related papers (2023-06-16T21:55:08Z)
Zero and Few-shot Semantic Parsing with Ambiguous Inputs [45.285508941560295]
We introduce AmP, a framework, dataset, and challenge for translating ambiguous natural language to formal representations like logic and code. Using AmP, we investigate how several few-shot text-to-code systems handle ambiguity, introducing three new metrics. We find that large pre-trained models perform poorly at capturing the distribution of possible meanings without deliberate instruction.
arXiv Detail & Related papers (2023-06-01T15:46:36Z)
Robust Unsupervised Cross-Lingual Word Embedding using Domain Flow Interpolation [48.32604585839687]
Previous adversarial approaches have shown promising results in inducing cross-lingual word embedding without parallel data. We propose to make use of a sequence of intermediate spaces for smooth bridging.
arXiv Detail & Related papers (2022-10-07T04:37:47Z)
On the Usefulness of Embeddings, Clusters and Strings for Text Generator Evaluation [86.19634542434711]
Mauve measures an information-theoretic divergence between two probability distributions over strings. We show that Mauve was right for the wrong reasons, and that its newly proposed divergence is not necessary for its high performance. We conclude that -- by encoding syntactic- and coherence-level features of text, while ignoring surface-level features -- such cluster-based substitutes to string distributions may simply be better for evaluating state-of-the-art language generators.
arXiv Detail & Related papers (2022-05-31T17:58:49Z)
Robust Textual Embedding against Word-level Adversarial Attacks [15.235449552083043]
We propose a novel robust training method, termed Fast Triplet Metric Learning (FTML) We show that FTML can significantly promote the model robustness against various advanced adversarial attacks. Our work shows the great potential of improving the textual robustness through robust word embedding.
arXiv Detail & Related papers (2022-02-28T14:25:00Z)
Locally Typical Sampling [84.62530743899025]
We show that today's probabilistic language generators fall short when it comes to producing coherent and fluent text.<n>We propose a simple and efficient procedure for enforcing this criterion when generating from probabilistic models.
arXiv Detail & Related papers (2022-02-01T18:58:45Z)
Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data. In this paper, we propose variable-length textual adversarial attacks(VL-Attack) Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.