Related papers: TRACE: Training and Inference-Time Interpretability Analysis for Language Models

TRACE: Training and Inference-Time Interpretability Analysis for Language Models

URL: http://arxiv.org/abs/2507.03668v1
Date: Fri, 04 Jul 2025 15:42:51 GMT
Title: TRACE: Training and Inference-Time Interpretability Analysis for Language Models
Authors: Nura Aljaafari, Danilo S. Carvalho, André Freitas,
Abstract summary: We introduce TRACE, a modular toolkit for training and inference-time interpretability analysis of transformer models.<n>It enables lightweight, in-training analysis of linguistic and representational signals, including features probing, intrinsic dimensionality, Hessian curvature, and output diagnostics.
Score: 10.777646083061395
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Understanding when and how linguistic knowledge emerges during language model training remains a central challenge for interpretability. Most existing tools are post hoc, rely on scalar metrics, or require nontrivial integration effort, making comprehensive interpretability analysis difficult to deploy and maintain. We introduce TRACE, a modular toolkit for training and inference-time interpretability analysis of transformer models. It enables lightweight, in-training analysis of linguistic and representational signals, including features probing, intrinsic dimensionality, Hessian curvature, and output diagnostics. It integrates with ABSynth, a controllable synthetic corpus generator that provides structured annotations for precise evaluation of linguistic feature acquisition. Experiments with autoregressive transformers demonstrate that TRACE reveals developmental phenomena such as early syntactic emergence, delayed semantic acquisition, and representational compression, signals overlooked by traditional scalar metrics such as loss or accuracy. With minimal integration effort, the tool enables layer-wise diagnostics, convergence-based early stopping, and detection of structural errors, making transformer analysis interpretable, actionable, and reproducible.

Related papers

Towards Worst-Case Guarantees with Scale-Aware Interpretability [58.519943565092724]
Neural networks organize information according to the hierarchical, multi-scale structure of natural data.<n>We propose a unifying research agenda -- emphscale-aware interpretability -- to develop formal machinery and interpretability tools.
arXiv Detail & Related papers (2026-02-05T01:22:31Z)
Stable Language Guidance for Vision-Language-Action Models [62.80963701282789]
Residual Semantic Steering is a probabilistic framework that disentangles physical affordance from semantic execution.<n> RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
arXiv Detail & Related papers (2026-01-07T16:16:10Z)
DeformAr: Rethinking NER Evaluation through Component Analysis and Visual Analytics [0.0]
We introduce DeformAr, a framework to investigate and explain the performance discrepancy between Arabic and English NER systems.<n>DeformAr is the first Arabic-specific, component-based interpretability tool, offering a crucial resource for advancing model analysis in under-resourced languages.
arXiv Detail & Related papers (2025-11-30T15:39:28Z)
Mathematical Analysis of Hallucination Dynamics in Large Language Models: Uncertainty Quantification, Advanced Decoding, and Principled Mitigation [0.0]
Large Language Models (LLMs) are powerful linguistic engines but remain susceptible to hallucinations.<n>We present a mathematically grounded framework to understand, measure, and mitigate these hallucinations.
arXiv Detail & Related papers (2025-11-19T00:58:36Z)
Priors in Time: Missing Inductive Biases for Language Model Interpretability [58.07412640266836]
We show that Sparse Autoencoders impose priors that assume independence of concepts across time, implying stationarity.<n>We introduce a new interpretability objective -- Temporal Feature Analysis -- which possesses a temporal inductive bias to decompose representations at a given time into two parts.<n>Our results underscore the need for inductive biases that match the data in designing robust interpretability tools.
arXiv Detail & Related papers (2025-11-03T18:43:48Z)
Autoformalizer with Tool Feedback [52.334957386319864]
Autoformalization addresses the scarcity of data for Automated Theorem Proving (ATP) by translating mathematical problems from natural language into formal statements.<n>Existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency.<n>We propose the Autoformalizer with Tool Feedback (ATF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process.
arXiv Detail & Related papers (2025-10-08T10:25:12Z)
Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining [33.22703101533052]
Large language models learn non-trivial abstractions during pretraining.<n>We use sparse crosscoders to discover and align features across model checkpoints.<n>We show that crosscoders can detect feature emergence, maintenance, and discontinuation during pretraining.
arXiv Detail & Related papers (2025-09-05T17:56:24Z)
Interference Matrix: Quantifying Cross-Lingual Interference in Transformer Encoders [55.749883010057545]
We construct an interference matrix by training and evaluating small BERT-like models on all possible language pairs.<n>Our analysis reveals that interference between languages is asymmetrical and that its patterns do not align with traditional linguistic characteristics.
arXiv Detail & Related papers (2025-08-04T10:02:19Z)
Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations [33.04242471060053]
Large Language Models (LLMs) exhibit a robust mastery of syntax when processing and generating text.<n>No comprehensive study has yet established whether a model's probing accuracy reliably predicts its downstream syntactic performance.
arXiv Detail & Related papers (2025-06-20T01:46:50Z)
TRACE for Tracking the Emergence of Semantic Representations in Transformers [10.777646083061395]
We introduce TRACE, a diagnostic framework combining geometric, informational, and linguistic signals to detect phase transitions in Transformer-based LMs.<n>Experiments reveal that phase transitions align with clear intersections between curvature collapse and dimension stabilisation; these geometric shifts coincide with emerging syntactic and semantic accuracy.<n>This work advances our understanding of how linguistic abstractions emerge in LMs, offering insights into model interpretability, training efficiency, and compositional generalisation.
arXiv Detail & Related papers (2025-05-23T15:03:51Z)
Temporal-Spectral-Spatial Unified Remote Sensing Dense Prediction [62.376936772702905]
Current deep learning architectures for remote sensing are fundamentally rigid.<n>We introduce the Spatial-Temporal-Spectral Unified Network (STSUN) for unified modeling.<n> STSUN can adapt to input and output data with arbitrary spatial sizes, temporal lengths, and spectral bands.<n>It unifies disparate dense prediction tasks within a single architecture by conditioning the model on trainable task embeddings.
arXiv Detail & Related papers (2025-05-18T07:39:17Z)
Structured Prompting and Feedback-Guided Reasoning with LLMs for Data Interpretation [0.0]
Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and task generalization.<n>This paper introduces the STROT Framework, a method for structured prompting and feedback-driven transformation logic generation.
arXiv Detail & Related papers (2025-05-03T00:05:01Z)
LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation [94.84458417662404]
LangTraj is a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios.<n>By conditioning on natural language inputs, LangTraj provides flexible and intuitive control over interactive behaviors.<n>LangTraj demonstrates strong performance in realism, language controllability, and language-conditioned safety-critical simulation.
arXiv Detail & Related papers (2025-04-15T17:14:06Z)
Analysis and Visualization of Linguistic Structures in Large Language Models: Neural Representations of Verb-Particle Constructions in BERT [0.0]
This study investigates the internal representations of verb-particle combinations within large language models (LLMs)<n>We analyse the representational efficacy of its layers for various verb-particle constructions such as 'agree on', 'come back', and 'give up'<n>Results show that BERT's middle layers most effectively capture syntactic structures, with significant variability in representational accuracy across different verb categories.
arXiv Detail & Related papers (2024-12-19T09:21:39Z)
Interpreting token compositionality in LLMs: A robustness analysis [10.777646083061395]
Constituent-Aware Pooling (CAP) is a methodology designed to analyse how large language models process linguistic structures.<n>CAP intervenes in model activations through constituent-based pooling at various model levels.<n>Our findings highlight fundamental limitations in current transformer architectures regarding compositional semantics processing and model interpretability.
arXiv Detail & Related papers (2024-10-16T18:10:50Z)
Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations [75.14793516745374]
We propose to strengthen the structural inductive bias of a Transformer by intermediate pre-training. Our experiments confirm that this helps with few-shot learning of syntactic tasks such as chunking. Our analysis shows that the intermediate pre-training leads to attention heads that keep track of which syntactic transformation needs to be applied to which token.
arXiv Detail & Related papers (2024-07-05T14:29:44Z)
Explaining Text Similarity in Transformer Models [52.571158418102584]
Recent advances in explainable AI have made it possible to mitigate limitations by leveraging improved explanations for Transformers. We use BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, to investigate which feature interactions drive similarity in NLP models. Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.
arXiv Detail & Related papers (2024-05-10T17:11:31Z)
Injecting linguistic knowledge into BERT for Dialogue State Tracking [60.42231674887294]
This paper proposes a method that extracts linguistic knowledge via an unsupervised framework. We then utilize this knowledge to augment BERT's performance and interpretability in Dialogue State Tracking (DST) tasks. We benchmark this framework on various DST tasks and observe a notable improvement in accuracy.
arXiv Detail & Related papers (2023-11-27T08:38:42Z)
A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.<n>We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.