Related papers: Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents

Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents

URL: http://arxiv.org/abs/2406.04028v1
Date: Thu, 6 Jun 2024 12:57:31 GMT
Title: Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents
Authors: Yoann Poupart,
Abstract summary: We propose contrastive sparse autoencoders (CSAE) for studying pairs of game trajectories. Using CSAE, we are able to extract and interpret concepts that are meaningful to the chess-agent plans.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI led chess systems to a superhuman level, yet these systems heavily rely on black-box algorithms. This is unsustainable in ensuring transparency to the end-user, particularly when these systems are responsible for sensitive decision-making. Recent interpretability work has shown that the inner representations of Deep Neural Networks (DNNs) were fathomable and contained human-understandable concepts. Yet, these methods are seldom contextualised and are often based on a single hidden state, which makes them unable to interpret multi-step reasoning, e.g. planning. In this respect, we propose contrastive sparse autoencoders (CSAE), a novel framework for studying pairs of game trajectories. Using CSAE, we are able to extract and interpret concepts that are meaningful to the chess-agent plans. We primarily focused on a qualitative analysis of the CSAE features before proposing an automated feature taxonomy. Furthermore, to evaluate the quality of our trained CSAE, we devise sanity checks to wave spurious correlations in our results.

Related papers

LLMs as Deceptive Agents: How Role-Based Prompting Induces Semantic Ambiguity in Puzzle Tasks [0.0]
This study is inspired by the popular puzzle game "Connections" We compare puzzles produced through zero-shot prompting, role-injected adversarial prompts, and human-crafted examples. We demonstrate that explicit adversarial agent behaviors significantly heighten semantic ambiguity.
arXiv Detail & Related papers (2025-04-03T03:45:58Z)
Mechanistic understanding and validation of large AI models with SemanticLens [13.712668314238082]
Unlike human-engineered systems such as aeroplanes, the inner workings of AI models remain largely opaque. This paper introduces SemanticLens, a universal explanation method for neural networks that maps hidden knowledge encoded by components.
arXiv Detail & Related papers (2025-01-09T17:47:34Z)
Interpretable end-to-end Neurosymbolic Reinforcement Learning agents [20.034972354302788]
This work places itself within the neurosymbolic AI paradigm, blending the strengths of neural networks with symbolic AI. We present the first implementation of an end-to-end trained SCoBot, separately evaluate of its components, on different Atari games.
arXiv Detail & Related papers (2024-10-18T10:59:13Z)
Perturbation on Feature Coalition: Towards Interpretable Deep Neural Networks [0.1398098625978622]
The "black box" nature of deep neural networks (DNNs) compromises their transparency and reliability. We introduce a perturbation-based interpretation guided by feature coalitions, which leverages deep information of network to extract correlated features.
arXiv Detail & Related papers (2024-08-23T22:44:21Z)
Visual Agents as Fast and Slow Thinkers [88.6691504568041]
We introduce FaST, which incorporates the Fast and Slow Thinking mechanism into visual agents. FaST employs a switch adapter to dynamically select between System 1/2 modes. It tackles uncertain and unseen objects by adjusting model confidence and integrating new contextual data.
arXiv Detail & Related papers (2024-08-16T17:44:02Z)
Feature CAM: Interpretable AI in Image Classification [2.4409988934338767]
There is a lack of trust to use Artificial Intelligence in critical and high-precision fields such as security, finance, health, and manufacturing industries. We introduce a novel technique Feature CAM, which falls in the perturbation-activation combination, to create fine-grained, class-discriminative visualizations. The resulting saliency maps proved to be 3-4 times better human interpretable than the state-of-the-art in ABM.
arXiv Detail & Related papers (2024-03-08T20:16:00Z)
Mathematical Algorithm Design for Deep Learning under Societal and Judicial Constraints: The Algorithmic Transparency Requirement [65.26723285209853]
We derive a framework to analyze whether a transparent implementation in a computing model is feasible. Based on previous results, we find that Blum-Shub-Smale Machines have the potential to establish trustworthy solvers for inverse problems.
arXiv Detail & Related papers (2024-01-18T15:32:38Z)
AS-XAI: Self-supervised Automatic Semantic Interpretation for CNN [5.42467030980398]
We propose a self-supervised automatic semantic interpretable artificial intelligence (AS-XAI) framework. It utilizes transparent embedding semantic extraction spaces and row-centered principal component analysis (PCA) for global semantic interpretation of model decisions. The proposed approach offers broad fine-grained practical applications, including shared semantic interpretation under out-of-distribution categories.
arXiv Detail & Related papers (2023-12-02T10:06:54Z)
Representation Engineering: A Top-Down Approach to AI Transparency [132.0398250233924]
We identify and characterize the emerging area of representation engineering (RepE) RepE places population-level representations, rather than neurons or circuits, at the center of analysis. We showcase how these methods can provide traction on a wide range of safety-relevant problems.
arXiv Detail & Related papers (2023-10-02T17:59:07Z)
ShadowNet for Data-Centric Quantum System Learning [188.683909185536]
We propose a data-centric learning paradigm combining the strength of neural-network protocols and classical shadows. Capitalizing on the generalization power of neural networks, this paradigm can be trained offline and excel at predicting previously unseen systems. We present the instantiation of our paradigm in quantum state tomography and direct fidelity estimation tasks and conduct numerical analysis up to 60 qubits.
arXiv Detail & Related papers (2023-08-22T09:11:53Z)
Interpretable Self-Aware Neural Networks for Robust Trajectory Prediction [50.79827516897913]
We introduce an interpretable paradigm for trajectory prediction that distributes the uncertainty among semantic concepts. We validate our approach on real-world autonomous driving data, demonstrating superior performance over state-of-the-art baselines.
arXiv Detail & Related papers (2022-11-16T06:28:20Z)
Interpretable part-whole hierarchies and conceptual-semantic relationships in neural networks [4.153804257347222]
We present Agglomerator, a framework capable of providing a representation of part-whole hierarchies from visual cues. We evaluate our method on common datasets, such as SmallNORB, MNIST, FashionMNIST, CIFAR-10, and CIFAR-100.
arXiv Detail & Related papers (2022-03-07T10:56:13Z)
Bias in Multimodal AI: Testbed for Fair Automatic Recruitment [73.85525896663371]
We study how current multimodal algorithms based on heterogeneous sources of information are affected by sensitive elements and inner biases in the data. We train automatic recruitment algorithms using a set of multimodal synthetic profiles consciously scored with gender and racial biases. Our methodology and results show how to generate fairer AI-based tools in general, and in particular fairer automated recruitment systems.
arXiv Detail & Related papers (2020-04-15T15:58:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.